A browser cookie contains a 128 bit "unique" identifier. 128 bits is an inconvenient size, being longer than native integer size on nearly all CPUs. The 128 bit number can be split into two 64-bit numbers; are both needed to correlate user records?
In my sample of 103 million cookie identifers, there were only 2 collisions when grouping by the left half of the identifier. For statistical calculations, this 0.000002% error rate is negligible.
In a sample of 93 million cookie identifiers, there was only 1 collision when grouping by the right half. In a sample of 129 million, there were 4 collisions when grouping by the right half.
The expected number of collisions among k samples out of n possible codes is:
|k ( 1 − ( 1 − 1/n )k−1 )|
The expected number of matching birthdays among 20 people is 1.016. How many collisions are expected from 103 million samples of randomly distributed 64 bit numbers?
In order to keep the numbers within double-precision range, use the fact that ln(1+ε) ≈ ε for ε near 0:
|k ⋅ ( 1 − ( 1 − 1/n )k−1 )||≈||k ⋅ ( 1 − e(1−k)/n )|
The expected number of collisions for 103×106 random 64.bit numbers is 575.115×10−6. So the cookie identifiers are not uniformly distributed. Another way of saying this is that the distribution of cookies has less entropy than a 64.bit random variable. How much entropy does the distribution have?
Solving the expected-collisions formula for n will give an estimate of the entropy (c is the number of collisions):
1 − k
ln( 1−c/k )
Two collisions among 103×106 cookie-IDs indicate around 52.2 bits of entropy (log2 5.3×1015). Looking at a sample of cookie-IDs notice that the digits under the "M" are all "4" and the digits under "N" are always "8", "9", "a", or "b".
M N 3001457e-e07a-4b20-ba16-f3f8624da98f 73f9436b-2868-4325-a62c-bf5a4f38dfdf 1c4df5b6-210b-4aba-9660-bdb129ea6f31 34c3fcb1-bf16-4a96-a7e9-ec377bface30 a1ddb5fc-4bb8-4321-a1bf-753668387197 3edd32e0-5bad-476d-bc1c-8cd8c80b88f8 2ee886c8-d877-480c-b109-455cd3c1b099 5c76c27d-7caf-4adf-8aec-65fee5ce281a 93af502b-61db-4596-bcfe-188c352296b5 ed981151-6179-45cf-8e0d-e4293caf62a0 54ea634b-ebfc-4371-97a7-5b1123264e4b 9a3cce97-2a4b-4a56-8fd9-85d9f07abce4 0bfdbd6f-5fd8-4c74-ab7e-1ccb1a81fd70 84ac5d8a-c64f-48a3-8d11-653aea56e2b4 f5e3afb2-e87b-45e1-9b1c-a49d3e106124 b5f832f5-0771-4554-8a00-da142aab1832 ca025994-9412-4311-bb62-c7e6458a94dc cf1fc6a6-b3d4-47a5-8aa5-063120d5b8c4 83add10c-b266-4995-8803-805d32c58147 d85c3ec5-3d77-4767-b87b-a42a20dc9f74 b6104d07-b29e-4b07-b3c8-65efd08bd003 21111ab5-83ab-4af8-a39f-1824a4055a9a 78346096-3202-48d8-9ed8-797a0f233804 1d124d3d-8ac2-405b-9a8a-afb70dac4720 e187bd94-c091-4fb5-a3c4-2a50e2fe4e35 8dca4524-cb3d-4a40-ae4d-881976a6173f 94e458b1-04c8-4495-b3fb-4cbbc4c37561 7d0819df-4a08-4553-b10b-379433b09e35 74332f07-3183-479b-a285-04c1a9e61bc4 4ac04ba2-950e-4315-bb14-eb3369dd1591 f8a20163-1db1-47e4-a6b5-d8fa4519eb79 ba778464-0bf4-4575-93e9-ac5b41923e3a 6dda046c-651d-4d9c-9482-4d3c1247a14a
This turns out to be RFC-4122 A Universally Unique IDentifier (UUID) URN Namespace format version 4. The left half of the UUID should have 60 bits of entropy if the random or psuedo-random source is uniformly distributed. The left half having only 52 bits of entropy indicates that the source quality could be improved. Even though the right half should have 2 more bits of entropy than the left, it also has about 52 bits.
Note that a linearly increasing sequence would not have any collisions although, as a sequence, it has low sequence-entropy. Although having the expected hash entropy is a test which a pseudo-random-number-generator should pass, PRNG sequences must also be "random". So low collision counts are a necessary, but not sufficient condition for PRNG quality.
This collision test's insensitivity to sequence makes it useful for testing hash functions. A good hash function will have have entropy near the number of codes it can output.
Sarah V. Jaffer has a much simpler way to compute the expected number of collisions. k is much smaller than n. The probability of one choice hitting another is k/n. There are k chances to hit another choice; so the expected number of collisions is roughly k2/n. In the cookie calculation, Sarah's formula matches within one part-per-million.
How are these formulas related? Because ln(1+ε)≈ε, it follows that eε−1≈ε (for small ε). Thus the expected number of collisions is:
k ⋅ ( 1 − e(1−k)/n ) ≈ k ⋅ (k−1)
Realizing that k≫1 leads to Sarah's formula k2/n.
I am a guest and not a member of the MIT Computer Science and Artificial Intelligence Laboratory.
My actions and comments do not reflect in any way on MIT.|
|CS and Numerical Analysis|
|agj @ alum.mit.edu||Go Figure!|