Wednesday, 4 December 2013

A Primer in Compressing things.

Encoding into bits

If you want to store a number in a computer, it has to be stored as a sequence of bits. These bits can be used to represent large numbers, but the number space represented by whole number counts of bits is always a power of two.

  • 1 bit = 2 options, or a range of 2 values = 2^1
  • 2 bits = 4 options, or a range of 4 values = 2^2
  • 3 bits = 8 options, or a range of 8 values = 2^3
  • 10 bit = 1024 options, or a range of 1024 values = 2^10

As long as you can fit your value in under the number of options, you can encode it in that many bits. You can encode numbers from zero up to 1023 in 10 bits as it is under 1024.

For example :

  • to store a 7 or a 4, you could use 3 bits (b111 and b100) as 7 and 4 are both under 8.
  • to store a 10, you could use 4 bits (b1010), as 10 is under 16 (2^4=16)

Using your knowledge

You don't have to store your numbers starting at zero. As long as you know the minimum and maximum possible, you can shift the options into your needed range.

For example, I know that the number of legs on any table I own is always three or four. That's only two options, so I can store it in 1 bit. I store the number of legs as 3+(encoded bit)

  • for 3 legs, encode a 0
  • for 4 legs, encode a 1
I know that the valid temperature range on a piece of equipment ranges from 45 to 55 degrees, so I can encode it as a range of 11 (so, really 16 as that's the smallest power of two above 11), so I encode the data at 45+encoded value
  • for 45 degrees, encode a 0 (b0000)
  • for 50 degrees, encode a 5 (b0101)
  • for 55 degrees, encode a 10 (b1010)

Wasting space

What's noticable from the temperature example is that the values 12-15 are never used. We're wasting potentially usable bits, except they aren't whole bits, so there is no obvious way of using them. How many options can you store in half a bit?

Using different bases

The reason for the waste is the number space being used. If we restrict ourselves to the number space of the computer, we make encoding and decoding easy for the CPU, but we waste storage space. If we use a base other than 2, we can pack the data really tight. In theory, it's possible to pack your data so that you only waste less than one bit of data.

A simple example is in order. Assume you have a list of booleans with an additional unknown state. These tribools come up frequently so this might be of practical use to some of you. Tribools require three options. How many can you store in a normal 8 bit byte? Using base 2, you have to use two bits per tribool, meaning that in 8 bits, you can store 4 tribools.

But 8 bits can store up to 256 different options, and if you slice it differently, you can pack more tribools in. Tribools (3) stored 5 times has less options than the bools (2) stored 4 times.
  • 2^4 = 256
  • 3^5 = 243
To store the tribools, we merely need to generate the "option" of that set of 5 tribools, and then encode that value instead. The trivial example is all tribools but the first are zero:

  • 0,0,0,0,0 => 0
  • 1,0,0,0,0 => 1
  • 2,0,0,0,0 => 2
what is interesting is what happens when we add to the other elements
  • 0,1,0,0,0 => 3 (1*0 + 3*1)
  • 0,0,1,0,0 => 9 (1*0 + 3*0 + 9*1)
  • 0,0,0,1,0 => 27 (1*0 + 3*0 + 9*0 +27*1)
  • 1,2,1,2,1 => 151 (1*1 + 3*2 + 9*1 + 27*2 + 81*1)
We multiply out the number by it's base. Element 0 * 3^0, element 1 * 3^1, element 2 * 3^2, element 3 * 3^3, and element 4 * 3^4. Now we can store 5 tribools in 8 bits because we are actually storing a number from 0 to 242 in 8 bits, but that number can be decoded into 5 different 3 option values. Decoding is handled by taking modulus and then dividing to remove that part of the encoded value.
  • 151 % 3 = 1
  • 151 / 3 => 50
  • 50 % 3 = 2
  • 50 / 3 = 16
  • 16 % 3 = 1
  • 16 / 3 = 5
  • 5 % 3 = 2
  • 5 / 3 => 1
  • 1 % 3 = 1
Decoding is just like figuring out the time of day from the number of seconds elapsed.

Probability of a value

Apart from knowing the range, we can also reduce the storage used by using our knowledge of how likely a thing is. This is known as probabilistic compression, where the chance of a value being selected affects how many bits it uses. The one you learn about first is Huffman coding, but an easier to swallow example might be useful.
Let's say we have a game about sewing seeds in a garden full of plots, and letting them grow. If I am trying to compress the type of seed planted in a plot, but most of the plots are empty, I can say that the most common type of plant is zero, the NULL plant. I can encode the plot data with only one bit for whether there is a plant, and then the actual plant type if there is. If we assume 8 types of plant, then:
  • no plant => (b0)
  • plant of type 4 (b1100) (bit 1 set because there is a plant, then bits 2-4 set as the plant type.
If the garden is empty, then the compressed sequence takes as many bits as there are plots. If all the plots are full, then the compressed sequence takes as many bits as there are plots more than a naive solution (actually, if there are 8 types of plant, then it's actually the same for a naive solution, as 8 types of plant means 9 options, which would require another bit as the lowest power of two greater than or equal to 9 is 4)

It's possible to encode like this automatically, generating the probability of each option, then generating a binary tree which defines how you encode and decode values. This is Huffman coding, and is very useful when you have no domain knowledge.

Compressing to less than 1 bit

If you have probability information, Huffman coding can be great, but it's not quite as efficient as it can be for the same reason why we found that we could save storage by changing the base when compressing the tribools. Huffman must encode at least one bit for even the most common data. As an example of how we can do better, let's compress a garden with an additional bit of information: whether there is anything in a whole row of plots.
  • no plants in row => (b00)
  • no plant in plot => (b01)
  • plant of type 4 => (b1100)
What happens here is that the save takes up two bits per row rather than 1 bit per plot. That might be a significant saving. You can probably think of many other ways to mark larger areas of empty data too. Gridding it out at a lower resolution, then setting bits for all cells that contain any data, for example. With domain knowledge comes much better compression. Huffman coding isn't readily adaptable to wildly biased data, but it can be if you give it more symbols to work with. Therein lies the tradeoff. If you have to send a large palette of symbols, does the palette cost more to send than the data? Run length encoding can get down to less than one bit per byte on some data because runs of 128 bytes can be encoded in a single byte, or runs of millions can be encoded in 32 bit RLE encoders.

Domain knowledge

Really powerful compression comes from understanding the probabilities and rules of data being compressed. If you have a look at the lossy compression algorithms, this becomes more easy to understand. In lossy compression, you have domain knowledge of what humans can and regularly perceive. If the computer is able to analyse the data and find the information and only compress that, then it can reduce a very large raw file down to something significantly smaller.

Let's compress our garden again. This time, we know that the garden cannot have any plant directly next to another plant.
  • no plant in plot => (b0)
  • plant of type 4 => (b1100)
  • plot where plan cannot exist because of following rules and relying on already encoded data => (b)
Yes, zero bits. Zero bits to encode the fact that a cell is empty when it's not possible to have a plant in it. It makes sense because the actual fact that the cell is empty was already encoded by the plot encoding a plant type. It's not really zero bits, it's 1/5th of a bit really as the plot with a plant in it defines the state for 5 cells. (the one following, and the three on the next row.

There are often many places where rules give you information for free, you just need to look out for them.
Post a Comment