Using hash functions and values can be a little more demanding though as you always have to remember that a hash code will be able to tell you, for certain, that two objects are different, but never be able to guarantee that they are the same. Also, because of the birthday problem (explained in a moment) the number of possible objects going into a hash that can come out unique is quite small to the size of the possible variants of the hash value. Even a perfect hash (one that has 1 bit difference half hash modification bit per bit 50% variance also) cannot get around the problem of a hash being a proof of non-similiarity. The reason is simple. A hash is a reduced amount of information.

The Birthday problem:

How likely do you think it is that someone at your workplace has the same birthday as you?

That is quite a simple calculation, your take the number of people in the office, minus one for yourself, then calculate 364 over 365 to the power of the number of other workers to figure out the probability that you dont share a birthday.

20 workers = 95% that you wont share.

100 workers = 76% that you wont share.

but, what about the probability that someone in the work place shares a birthday with someone else?

well, if you're out of the picture, then there's the rest of the office against random dude.

20 workers = 95% for you, 95.2% for dude...

what about if we consider the number of people in the office in total with those that they could check against? that's 190 possible birthday comparisons for a room of 20 people...

20 workers = 59.4% chance of not sharing a birthday!

oh dear...

100 workers = 0.00012% chance of not sharing a birthday...

office count | combinations | probability of unshared birthday | ||

5 | 10 | 0.972938 | ||

10 | 45 | 0.88386 | ||

15 | 105 | 0.749712 | ||

20 | 190 | 0.593771 | ||

25 | 300 | 0.439092 | ||

30 | 435 | 0.303184 | ||

35 | 595 | 0.195465 | ||

40 | 780 | 0.117664 | ||

45 | 990 | 0.066135 | ||

50 | 1225 | 0.034709 | ||

55 | 1485 | 0.017008 | ||

60 | 1770 | 0.007782 | ||

65 | 2080 | 0.003324 | ||

70 | 2415 | 0.001326 |

okay, so that's fairly mad. A better view of the combinatoric reasoning for the sudden growth can be found here: http://en.wikipedia.org/wiki/Birthday_paradox

so... for a hash code that that has 4 billion values, we can expect to get quite a few hashes, but we will hit a wall at about 77000 as at that point, the probability of having a collision is just over 50%, and no matter how many times you adjust the seed, once you get to about 200000, it will be almost impossible to find a valid non-colliding hash. That's 4 billion 4,000,000,000 can't support 200,000 different hashes, just because of some silly factorial law.

## No comments:

Post a comment