Rethinking the DBMS

2009/02/24

How unique are GUIDs?

Filed under: Binary implementation — Tags: , — Ben Samuel @ 22:31

How did GUIDs come about?

Naming things can be difficult. One approach is to have a centralized system, the way you get a book gets an ISBN. The downside is that some central agency has to be notified. The other approach is to have many agencies and let each allocate a range of numbers; this is the way the Social Security Administration operated initially and this is the way IP addresses are allocated. These systems are more on the end of a spectrum that gains in terms of shorter identifier length but loses in terms of centralization.

Another approach is to pick a big number at random, big enough that the odds that two numbers will collide are vanishingly small. The downside is that it’s a big, random number, which is difficult for humans to work with. The upside is that you don’t have to plan anything or do any synchronization: it’s just picking a number at random. This is common in nature: proteins on a cell contain identifying information that is used by the immune system to determine friend and foe.

Many identifying systems that made poor assumptions in planning have collisions. IP4 has this problem: the countries that grabbed blocks early on have lots of blocks of addresses and countries that are only now expanding their usage, like China and India, have relatively few.

The bits and bytes of a GUID

GUID stands for Globally Unique Identifier; it’s a more common term than Universally Unique Identifier, but they mean essentially the same thing. And to be clear, there are a number of ways of designing a GUID and different people with different needs have arrived at various solutions concurrently. To keep this article to a reasonable length, I’m only addressing GUIDs as defined by RFC 4122, and even then only a specific type of GUID.

By a minimal reading of the spec, a GUID is 128 bits of random binary data that doesn’t equal anything else unless you copy it bit for bit, and for most software developers, that’s all they care about. Going deeper into the standards, the layout shows the various techniques that the designers used to ensure uniqueness.

Section 4.2.1 of the RFC describes the actual layout, though you have to read further because the layout is used differently by different algorithms. The basic algorithm uses the precise time and a 48-bit “spatial node” handed out by a central service. The big number picked at random method that I’m addressing with this article is under section 4.4, Algorithms for Creating a UUID from Truly Random or Pseudo-Random Numbers. When you parse what they mean (or just look at the source code in the appendix) you can see that of 128 bits, 28 are fixed and 100 bits are completely random.

How do we measure uniqueness?

It’s easy to make the trivial observation, as many articles do, that the odds of picking two numbers from 0 through 2100 – 1 are ridiculously small. They are easy to calculate: we’ll call the first number you pick ‘n’. It doesn’t matter what n is, just that it’s a number, because it has no effect on what the second number will be. The odds that the second number you pick will be equal to ‘n’ are 1 : 2100 – 1.

(Caveat: I’m making a big assumption here: that you used 100 bits of uniformly distributed random data. Although there are very high quality RNGs available, uniformity of the distribution may require specialized algorithms, hardware, or simply more CPU time. On the bright side, all of those things are often available.)

But we’re not using just 2 GUIDs at a time. What are the odds if we pick out many GUIDs?

We’ll try picking three. Of three GUIDs, we’ll pick 3 from 2100, which we’ll call ‘s’ to make it easier to read. The logic works like so:

  1. Pick #1. Odds that it is the same as anything else are 0, trivially.
  2. Pick #2. Odds are 1 in s that it is the same as #1, s-1 in s that it is not.
  3. Pick #3. 2 in s that it is the same as #1 or #2, s-2 in s that it is not.

If we continue picking more, at ‘n’ picks, there is an n in s chance that it is the same as one we picked before and an s-n in s chance that it is not.

It’s easier to do the math if we phrase the question as a negative: what are the odds that we don’t ever pick the same number? We, therefore, make n picks and must successfully avoid picking the same number twice through any pick, and it gets progressively harder each time we pick a number.

The odds for each pick, however, are mutually independent which means that one pick doesn’t affect the odds of the next. The advantage of this is that we can simply multiply the odds:

(s-1)/s * (s-2)/s * (s-3)/s ... (s-n)/s

Now, that’s easy when n is small, but we already know that it works for very small values of n. The real question is when we pick millions, billions or even trillions of numbers. So we need to rewrite this as a factorial:

(s-1)! / ( (s-n-1)! * sn )

A first answer to a first question

The problem is that big numbers are hard to calculate. We’re starting with 2100, which is larger than even a 64-bit computer can comfortably work with, and we’re taking a factorial of that.

There are two ways to work with big numbers: 1. use logarithms and 2. use a bignum package. These numbers are big enough that they require both tricks. Fortunately, they are also big enough that we can cheat and approximate the factorial.

The code, available here goes through the steps and tries many different ways of getting the answer mostly to verify that I wasn’t cheating too much. It should work fine on a typical installation of Perl, but Math::BigInt::GMP would certainly help. Most of the functions, except the last one, will hang unless you’re using small numbers or have a powerful computer. For casual readers, I’ll simply show the results here. What it’s saying is that if you pick 226 numbers from a search space of 2100, there’s a 99.9999999999997% chance that none of them will be the same.

log of expanded stirling approximation, Gosper's improvement

(Data for before 226 is omitted as the system didn’t use high enough precision.)

26 of 100 : 0.999999999999997
27 of 100 : 0.999999999999992
28 of 100 : 0.999999999999971
29 of 100 : 0.999999999999885
30 of 100 : 0.999999999999544 — about a billion objects
31 of 100 : 0.999999999998180
32 of 100 : 0.999999999992723
33 of 100 : 0.999999999970895
34 of 100 : 0.999999999883584
35 of 100 : 0.999999999534338
36 of 100 : 0.999999998137354
37 of 100 : 0.999999992549419
38 of 100 : 0.999999970197677
39 of 100 : 0.999999880790717
40 of 100 : 0.999999523162955 — about a trillion objects
41 of 100 : 0.999998092653185
42 of 100 : 0.999992370634572
43 of 100 : 0.999969482887531
44 of 100 : 0.999877937137777
45 of 100 : 0.999511837939888
46 of 100 : 0.998048781107475
47 of 100 : 0.992217938260243
48 of 100 : 0.969233234476343
49 of 100 : 0.882496902584595 — just shy of a quadrillion objects

Conclusion

This is certainly better than the assumption we had before that the chance of a collision was infinitesimal. But, really, what do these numbers mean? They say what they say: if your system identifies a trillion objects, you have a 99.9999% chance that there will not be one collision. So, if you were hoping for a headline that would say “GUIDs proven unworkable,” sorry to disappoint you.

I’d like to address two issues in a followup to this article: the effects of uniformity of the random number generator, and a better answer to the question of how you actually determine the optimal size of a unique identifier, based on various uses.

Defining relation value and relation variable

Filed under: Theory — Tags: — Ben Samuel @ 15:49

This entry is to lay a little groundwork as many readers aren’t familiar with the relational algebra, or the calculus. (Incidentally, I recently got a copy of Logic and Databases, where Date goes into exactly what the two terms mean; more to come when I’ve finished reading it.) Even if you are, there are a lot of wrong ideas about some of the basics and anyone could be stuck with them.

  • A relation value, in the proper context, is a structured assertion about the world.

Context is important: if someone asks, “How many fingers do you have?” then the number 5, an integer value, is an assertion about the world. A value has meaning only inasmuch as it is placed in context.

As an example of what it means to place something in context, consider a computer program. By itself it is just a long stream of symbols, but if you run it in a computer it actually does something. This sonds like common sense, but in reality most people don’t understand this and often succumb to magical thinking. You see this often in the phrase “there ought to be a law” which carries the implicit assumption that merely writing a proscription against something will make it stop.

Structure means that you can derive new values from a value in a reproducible fashion. In particular, I’m talking about deterministic structure, such as the fact that you can add integers to get a certain sum, or decompose them to a product of primes. Nondeterministic structure would include statistical operations. For example, say you have a plain text representation of a book, written in English. You could analyze the frequency of various letters or letter combinations and assert with a high degree of confidence that the language is English, but there’s no mathematical operation that can say, for certain, that the language is X and you can’t take an argument from the book and algorithmically determine a counter-argument. The structure is there, the book definitely is written in English, it has sentences, ideas, themes, &c, but you’re forced to compare it indirectly with a mathematical model.

  • A relation variable is a portion of the larger context adding meaning to the value it is assigned.

The structure of a relation variable is typically represented as a table with labeled columns and unlabeled rows. The problem with any representation is that it tends to add stuff that shouldn’t be there. Here’s an example:

simple-table.png

This is a pretty typical example of how people record information. There are some minor issues, such as the fact that this representation suggests ordered columns and rows, but neither are ordered in a relation value. The biggest issue that this representation has added that simply doesn’t exist in an individual relation variable is meaning. Even if you weren’t sure what a sign-in sheet was, you could plainly see a list of people signing their name in chronological order all on the same day, during business hours.

In a typical DBMS, you might have a comment that explained roughly what your table is for, sometimes euphemistically known as a natural language predicate, but that’s just a comment to you. All you can really attach to a variable is a set of constraints, saying what values aren’t allowed. Even the predicate logic view doesn’t get you very far. The idea is that each row is a proposition. A proposition is saying that p(a, b, c, d) is true. And the predicate it exists in is just the pattern of the proposition, it adds no meaning.

The reason I bring all this up is that plenty of people did buy into the idea that adding labels and comments can add meaning; the latest example of attempting this is XML. Mathematically, XML gets you nothing you didn’t already have with symbolic expressions, which date back at least to the ‘60s. (This isn’t to say that the technologies that XML replaced weren’t worse.)

Conclusion What I’m going to describe in later posts is a minimalist representation for relation values, based on a small addition to the relational algebra. We’ll explore some basic relation operators and derive a standard form.

An Overview

Filed under: General — Ben Samuel @ 13:29

The premise behind this blog is that wherever software systems use a general purpose database management system, the result is that data is manageable, reliable and accessible to ad hoc queries and the user interface is well separated from the underlying storage. This isn’t to say that general purpose DBMSs are perfect or that every use of a DBMS works as it ought to, but it’s clear that they are head and shoulders above purpose built or home brewed systems.

My goal in writing this is to look at various applications and application domains, survey what they presently use to manage data, and how that might change. I will also examine the theoretical basis of what they’re trying to accomplish.

Create a free website or blog at WordPress.com.