Tuesday, 1 July 2008

Hash functions.

Hashes reduce the amount of information in an object down to some limited form. They take all the bytes that make up the original, and combine them in all sorts of ways to generate a hash value. A good hash function will modify approximately half the bits of the hash value whenever 1 bit of the original byte stream has changed. There are some very good hash functions out there, and rolling your own isn't that difficult to do well.

Using hash functions and values can be a little more demanding though as you always have to remember that a hash code will be able to tell you, for certain, that two objects are different, but never be able to guarantee that they are the same. Also, because of the birthday problem (explained in a moment) the number of possible objects going into a hash that can come out unique is quite small to the size of the possible variants of the hash value. Even a perfect hash (one that has 1 bit difference half hash modification bit per bit 50% variance also) cannot get around the problem of a hash being a proof of non-similiarity. The reason is simple. A hash is a reduced amount of information.

The Birthday problem:
How likely do you think it is that someone at your workplace has the same birthday as you?
That is quite a simple calculation, your take the number of people in the office, minus one for yourself, then calculate 364 over 365 to the power of the number of other workers to figure out the probability that you dont share a birthday.

20 workers = 95% that you wont share.
100 workers = 76% that you wont share.

but, what about the probability that someone in the work place shares a birthday with someone else?

well, if you're out of the picture, then there's the rest of the office against random dude.

20 workers = 95% for you, 95.2% for dude...

what about if we consider the number of people in the office in total with those that they could check against? that's 190 possible birthday comparisons for a room of 20 people...

20 workers = 59.4% chance of not sharing a birthday!
oh dear...
100 workers = 0.00012% chance of not sharing a birthday...

office count combinations probability of unshared birthday
5
10
0.972938
10
45
0.88386
15
105
0.749712
20
190
0.593771
25
300
0.439092
30
435
0.303184
35
595
0.195465
40
780
0.117664
45
990
0.066135
50
1225
0.034709
55
1485
0.017008
60
1770
0.007782
65
2080
0.003324
70
2415
0.001326


okay, so that's fairly mad. A better view of the combinatoric reasoning for the sudden growth can be found here: http://en.wikipedia.org/wiki/Birthday_paradox

so... for a hash code that that has 4 billion values, we can expect to get quite a few hashes, but we will hit a wall at about 77000 as at that point, the probability of having a collision is just over 50%, and no matter how many times you adjust the seed, once you get to about 200000, it will be almost impossible to find a valid non-colliding hash. That's 4 billion 4,000,000,000 can't support 200,000 different hashes, just because of some silly factorial law.
Post a Comment