Hash Entropy

`3001457e- e07a-4b20 -ba16-f3f 8624da98f`	http://people.csail.mit.edu/jaffer/III/Hash-Entropy
	Hash Entropy

A browser cookie contains a 128 bit "unique" identifier. 128 bits is an inconvenient size, being longer than native integer size on nearly all CPUs. The 128 bit number can be split into two 64-bit numbers; are both needed to correlate user records?

In my sample of 103 million cookie identifers, there were only 2 collisions when grouping by the left half of the identifier. For statistical calculations, this 0.000002% error rate is negligible.

In a sample of 93 million cookie identifiers, there was only 1 collision when grouping by the right half. In a sample of 129 million, there were 4 collisions when grouping by the right half.

The expected number of collisions among k samples out of n possible codes is:

k ( 1 − ( 1 − 1/n )^k−1 )

The expected number of matching birthdays among 20 people is 1.016. How many collisions are expected from 103 million samples of randomly distributed 64 bit numbers?

In order to keep the numbers within double-precision range, use the fact that ln(1+ε) ≈ ε for ε near 0:

k ⋅ ( 1 − ( 1 − 1/n )^k−1 )

≈

k ⋅ ( 1 − e^(1−k)/n )

The expected number of collisions for 103×10⁶ random 64.bit numbers is 575.115×10⁻⁶. So the cookie identifiers are not uniformly distributed. Another way of saying this is that the distribution of cookies has less entropy than a 64.bit random variable. How much entropy does the distribution have?

Solving the expected-collisions formula for n will give an estimate of the entropy (c is the number of collisions):

n ≈	1 − k ln( 1−c/k )

Two collisions among 103×10⁶ cookie-IDs indicate around 52.2 bits of entropy (log₂ 5.3×10¹⁵). Looking at a sample of cookie-IDs notice that the digits under the "M" are all "4" and the digits under "N" are always "8", "9", "a", or "b".

              M    N
3001457e-e07a-4b20-ba16-f3f8624da98f
73f9436b-2868-4325-a62c-bf5a4f38dfdf
1c4df5b6-210b-4aba-9660-bdb129ea6f31
34c3fcb1-bf16-4a96-a7e9-ec377bface30
a1ddb5fc-4bb8-4321-a1bf-753668387197
3edd32e0-5bad-476d-bc1c-8cd8c80b88f8
2ee886c8-d877-480c-b109-455cd3c1b099
5c76c27d-7caf-4adf-8aec-65fee5ce281a
93af502b-61db-4596-bcfe-188c352296b5
ed981151-6179-45cf-8e0d-e4293caf62a0
54ea634b-ebfc-4371-97a7-5b1123264e4b
9a3cce97-2a4b-4a56-8fd9-85d9f07abce4
0bfdbd6f-5fd8-4c74-ab7e-1ccb1a81fd70
84ac5d8a-c64f-48a3-8d11-653aea56e2b4
f5e3afb2-e87b-45e1-9b1c-a49d3e106124
b5f832f5-0771-4554-8a00-da142aab1832
ca025994-9412-4311-bb62-c7e6458a94dc
cf1fc6a6-b3d4-47a5-8aa5-063120d5b8c4
83add10c-b266-4995-8803-805d32c58147
d85c3ec5-3d77-4767-b87b-a42a20dc9f74
b6104d07-b29e-4b07-b3c8-65efd08bd003
21111ab5-83ab-4af8-a39f-1824a4055a9a
78346096-3202-48d8-9ed8-797a0f233804
1d124d3d-8ac2-405b-9a8a-afb70dac4720
e187bd94-c091-4fb5-a3c4-2a50e2fe4e35
8dca4524-cb3d-4a40-ae4d-881976a6173f
94e458b1-04c8-4495-b3fb-4cbbc4c37561
7d0819df-4a08-4553-b10b-379433b09e35
74332f07-3183-479b-a285-04c1a9e61bc4
4ac04ba2-950e-4315-bb14-eb3369dd1591
f8a20163-1db1-47e4-a6b5-d8fa4519eb79
ba778464-0bf4-4575-93e9-ac5b41923e3a
6dda046c-651d-4d9c-9482-4d3c1247a14a

This turns out to be RFC-4122 A Universally Unique IDentifier (UUID) URN Namespace format version 4. The left half of the UUID should have 60 bits of entropy if the random or psuedo-random source is uniformly distributed. The left half having only 52 bits of entropy indicates that the source quality could be improved. Even though the right half should have 2 more bits of entropy than the left, it also has about 52 bits.

Note that a linearly increasing sequence would not have any collisions although, as a sequence, it has low sequence-entropy. Although having the expected hash entropy is a test which a pseudo-random-number-generator should pass, PRNG sequences must also be "random". So low collision counts are a necessary, but not sufficient condition for PRNG quality.

This collision test's insensitivity to sequence makes it useful for testing hash functions. A good hash function will have have entropy near the number of codes it can output.

Sarah V. Jaffer has a much simpler way to compute the expected number of collisions. k is much smaller than n. The probability of one choice hitting another is k/n. There are k chances to hit another choice; so the expected number of collisions is roughly k²/n. In the cookie calculation, Sarah's formula matches within one part-per-million.

How are these formulas related? Because ln(1+ε)≈ε, it follows that e^ε−1≈ε (for small ε). Thus the expected number of collisions is:

k ⋅ ( 1 − e^(1−k)/n ) ≈ k ⋅ (k−1)
n

Realizing that k≫1 leads to Sarah's formula k²/n.

I am a guest and not a member of the MIT Computer Science and Artificial Intelligence Laboratory. My actions and comments do not reflect in any way on MIT.
		CS and Numerical Analysis
	agj @ alum.mit.edu	Go Figure!