## Tuesday, 1 July 2008

### Hash functions.

Hashes reduce the amount of information in an object down to some limited form. They take all the bytes that make up the original, and combine them in all sorts of ways to generate a hash value. A good hash function will modify approximately half the bits of the hash value whenever 1 bit of the original byte stream has changed. There are some very good hash functions out there, and rolling your own isn't that difficult to do well.

Using hash functions and values can be a little more demanding though as you always have to remember that a hash code will be able to tell you, for certain, that two objects are different, but never be able to guarantee that they are the same. Also, because of the birthday problem (explained in a moment) the number of possible objects going into a hash that can come out unique is quite small to the size of the possible variants of the hash value. Even a perfect hash (one that has 1 bit difference half hash modification bit per bit 50% variance also) cannot get around the problem of a hash being a proof of non-similiarity. The reason is simple. A hash is a reduced amount of information.

The Birthday problem:
How likely do you think it is that someone at your workplace has the same birthday as you?
That is quite a simple calculation, your take the number of people in the office, minus one for yourself, then calculate 364 over 365 to the power of the number of other workers to figure out the probability that you dont share a birthday.

20 workers = 95% that you wont share.
100 workers = 76% that you wont share.

but, what about the probability that someone in the work place shares a birthday with someone else?

well, if you're out of the picture, then there's the rest of the office against random dude.

20 workers = 95% for you, 95.2% for dude...

what about if we consider the number of people in the office in total with those that they could check against? that's 190 possible birthday comparisons for a room of 20 people...

20 workers = 59.4% chance of not sharing a birthday!
oh dear...
100 workers = 0.00012% chance of not sharing a birthday...

 office count combinations probability of unshared birthday 5 10 0.972938 10 45 0.88386 15 105 0.749712 20 190 0.593771 25 300 0.439092 30 435 0.303184 35 595 0.195465 40 780 0.117664 45 990 0.066135 50 1225 0.034709 55 1485 0.017008 60 1770 0.007782 65 2080 0.003324 70 2415 0.001326

okay, so that's fairly mad. A better view of the combinatoric reasoning for the sudden growth can be found here: http://en.wikipedia.org/wiki/Birthday_paradox

so... for a hash code that that has 4 billion values, we can expect to get quite a few hashes, but we will hit a wall at about 77000 as at that point, the probability of having a collision is just over 50%, and no matter how many times you adjust the seed, once you get to about 200000, it will be almost impossible to find a valid non-colliding hash. That's 4 billion 4,000,000,000 can't support 200,000 different hashes, just because of some silly factorial law.