Thursday, 24 November 2011

Existence checks

Let's say you have a large table of data, lots of rows with a few columns. You know that every row is unique because that's a basic feature of your table, no duplicates. There are only a few instances where non-unique elements are a good idea in tables, so we're going to ignore that corner case for now.

Let's look at what we have with an example:

Find out if Bob is the leader of team A
Name Clan Rank
alice team A leader
bob team B private
bob A team leader
charlie A team private

We can see that we can't just check for any one column and return on success, this is much like checking for a string, we have to check every element before we're sure we have a match. You could maintain the table's alphabetical order, reducing the string queries down to only the rows that match "bob" or "team A" (dependent on which column you sorted by). If you walk through the table, you have to string match at least one column.

But, we know that the table is made up of unique elements. There will be only one row returned if at all, so that makes this table a set. It's the set of Name Clan Rank associations that are known to be true. There are a very large number of potentially true values in this set, the product of the list of all potential names, all potential clans, and all potential ranks. Let's assume 4 billion names, 1 million clans, and 100 ranks. If we were to use a bit field to identify all the possible permutations, we'd not be able to fit the array into our computers. 400 trillion combinations would be quite literally, larger than we could handle. But sparse bit fields have better solutions than just simple array lookups.

What we've done is reduced the problem to its essential components. We're not looking for the row that contains the data, we're looking to see if the data exists. We don't want to access the data, only find out if it's in there. For this task, there are many solutions, but I'll just mention one.

I like hashes. My solution to this is to store the hashes of all the rows in an array of small arrays. I index into those small arrays by a small number of bits, the smallest amount of bits I can that differentiates the hashes enough that all the associated ones don't overflow the capacity of the cacheline sized mini arrays.

H1[bits0-n] -> index of array it should be in.

With this, I do one cache line load to get at the array of hashes, and if the hash is in there, then I can return that the row does exist. If not, then I can assert that it doesn't. And that's quick.

Post a Comment