Rethinking the DBMS

2009/02/24

How unique are GUIDs?

Filed under: Binary implementation — Tags: , — Ben Samuel @ 22:31

How did GUIDs come about?

Naming things can be difficult. One approach is to have a centralized system, the way you get a book gets an ISBN. The downside is that some central agency has to be notified. The other approach is to have many agencies and let each allocate a range of numbers; this is the way the Social Security Administration operated initially and this is the way IP addresses are allocated. These systems are more on the end of a spectrum that gains in terms of shorter identifier length but loses in terms of centralization.

Another approach is to pick a big number at random, big enough that the odds that two numbers will collide are vanishingly small. The downside is that it’s a big, random number, which is difficult for humans to work with. The upside is that you don’t have to plan anything or do any synchronization: it’s just picking a number at random. This is common in nature: proteins on a cell contain identifying information that is used by the immune system to determine friend and foe.

Many identifying systems that made poor assumptions in planning have collisions. IP4 has this problem: the countries that grabbed blocks early on have lots of blocks of addresses and countries that are only now expanding their usage, like China and India, have relatively few.

The bits and bytes of a GUID

GUID stands for Globally Unique Identifier; it’s a more common term than Universally Unique Identifier, but they mean essentially the same thing. And to be clear, there are a number of ways of designing a GUID and different people with different needs have arrived at various solutions concurrently. To keep this article to a reasonable length, I’m only addressing GUIDs as defined by RFC 4122, and even then only a specific type of GUID.

By a minimal reading of the spec, a GUID is 128 bits of random binary data that doesn’t equal anything else unless you copy it bit for bit, and for most software developers, that’s all they care about. Going deeper into the standards, the layout shows the various techniques that the designers used to ensure uniqueness.

Section 4.2.1 of the RFC describes the actual layout, though you have to read further because the layout is used differently by different algorithms. The basic algorithm uses the precise time and a 48-bit “spatial node” handed out by a central service. The big number picked at random method that I’m addressing with this article is under section 4.4, Algorithms for Creating a UUID from Truly Random or Pseudo-Random Numbers. When you parse what they mean (or just look at the source code in the appendix) you can see that of 128 bits, 28 are fixed and 100 bits are completely random.

How do we measure uniqueness?

It’s easy to make the trivial observation, as many articles do, that the odds of picking two numbers from 0 through 2100 – 1 are ridiculously small. They are easy to calculate: we’ll call the first number you pick ‘n’. It doesn’t matter what n is, just that it’s a number, because it has no effect on what the second number will be. The odds that the second number you pick will be equal to ‘n’ are 1 : 2100 – 1.

(Caveat: I’m making a big assumption here: that you used 100 bits of uniformly distributed random data. Although there are very high quality RNGs available, uniformity of the distribution may require specialized algorithms, hardware, or simply more CPU time. On the bright side, all of those things are often available.)

But we’re not using just 2 GUIDs at a time. What are the odds if we pick out many GUIDs?

We’ll try picking three. Of three GUIDs, we’ll pick 3 from 2100, which we’ll call ‘s’ to make it easier to read. The logic works like so:

  1. Pick #1. Odds that it is the same as anything else are 0, trivially.
  2. Pick #2. Odds are 1 in s that it is the same as #1, s-1 in s that it is not.
  3. Pick #3. 2 in s that it is the same as #1 or #2, s-2 in s that it is not.

If we continue picking more, at ‘n’ picks, there is an n in s chance that it is the same as one we picked before and an s-n in s chance that it is not.

It’s easier to do the math if we phrase the question as a negative: what are the odds that we don’t ever pick the same number? We, therefore, make n picks and must successfully avoid picking the same number twice through any pick, and it gets progressively harder each time we pick a number.

The odds for each pick, however, are mutually independent which means that one pick doesn’t affect the odds of the next. The advantage of this is that we can simply multiply the odds:

(s-1)/s * (s-2)/s * (s-3)/s ... (s-n)/s

Now, that’s easy when n is small, but we already know that it works for very small values of n. The real question is when we pick millions, billions or even trillions of numbers. So we need to rewrite this as a factorial:

(s-1)! / ( (s-n-1)! * sn )

A first answer to a first question

The problem is that big numbers are hard to calculate. We’re starting with 2100, which is larger than even a 64-bit computer can comfortably work with, and we’re taking a factorial of that.

There are two ways to work with big numbers: 1. use logarithms and 2. use a bignum package. These numbers are big enough that they require both tricks. Fortunately, they are also big enough that we can cheat and approximate the factorial.

The code, available here goes through the steps and tries many different ways of getting the answer mostly to verify that I wasn’t cheating too much. It should work fine on a typical installation of Perl, but Math::BigInt::GMP would certainly help. Most of the functions, except the last one, will hang unless you’re using small numbers or have a powerful computer. For casual readers, I’ll simply show the results here. What it’s saying is that if you pick 226 numbers from a search space of 2100, there’s a 99.9999999999997% chance that none of them will be the same.

log of expanded stirling approximation, Gosper's improvement

(Data for before 226 is omitted as the system didn’t use high enough precision.)

26 of 100 : 0.999999999999997
27 of 100 : 0.999999999999992
28 of 100 : 0.999999999999971
29 of 100 : 0.999999999999885
30 of 100 : 0.999999999999544 — about a billion objects
31 of 100 : 0.999999999998180
32 of 100 : 0.999999999992723
33 of 100 : 0.999999999970895
34 of 100 : 0.999999999883584
35 of 100 : 0.999999999534338
36 of 100 : 0.999999998137354
37 of 100 : 0.999999992549419
38 of 100 : 0.999999970197677
39 of 100 : 0.999999880790717
40 of 100 : 0.999999523162955 — about a trillion objects
41 of 100 : 0.999998092653185
42 of 100 : 0.999992370634572
43 of 100 : 0.999969482887531
44 of 100 : 0.999877937137777
45 of 100 : 0.999511837939888
46 of 100 : 0.998048781107475
47 of 100 : 0.992217938260243
48 of 100 : 0.969233234476343
49 of 100 : 0.882496902584595 — just shy of a quadrillion objects

Conclusion

This is certainly better than the assumption we had before that the chance of a collision was infinitesimal. But, really, what do these numbers mean? They say what they say: if your system identifies a trillion objects, you have a 99.9999% chance that there will not be one collision. So, if you were hoping for a headline that would say “GUIDs proven unworkable,” sorry to disappoint you.

I’d like to address two issues in a followup to this article: the effects of uniformity of the random number generator, and a better answer to the question of how you actually determine the optimal size of a unique identifier, based on various uses.

Advertisements

Create a free website or blog at WordPress.com.