Introduction to Data Compression – Huffman Coding

Posted by Jaran on 2011/08/14 Leave a comment (1) Go to comments

Despite unprecedented storage capacity and Internet bandwidth available to everyone the amount of information growing daily as well means data compression is a must. Without it mobile music players would only be able to store a tenth of the songs they do, the same factor of about ten goes for digital cameras and HD television would be mostly impossible at all.

I hope to be able to describe how data compression actually works. However, the topic is somewhat complex, especially when coming to audio, image and video compression. While those topics will be discussed in articles (hopefully) to come I want to start with one of the most simple methods to compress data: Huffman coding.

Let’s consider the following sentence as something we want to compress: “Brevity is the soul of wit”
What does compression mean in regard to the example anyway? Usually a single character takes one byte of memory. As such the above sentence needs 26 bytes (21 letters and 5 spaces). Is there any way to store it in less than 26 bytes or 208 bits? The most simple solution would be to use less than all eight bits which compose one byte to store a single character. After all a byte, that is eight bits, can represent 256 ( $2^8=256$ ) different values while our alphabet has only 26 letters. Differentiating between upper and lower case that makes 52 and even if we add punctuation marks and a space symbol to separate words (and neglect numerical digits) we’re below 64 different characters which would need only 6 bits to encode (since $2^6=64$ ). The 26 characters of our example times 6 bits makes a total of 156 bits.
Basically what has happened here is that the normal way of storing characters in a computer, i.e. using 8 bits per character, can be substituted with one that uses less, in our example 6 bits per character.

This cuts down memory usage to 3/4s but this isn’t the end. Instead of trying to encode all possible characters (actually we already omitted numerals digits in the thoughts above) we could use a special encoding scheme only for the characters of our example sentence. If we take a closer look at it we can count that is only uses 14 different letters and the space symbol, giving a total of 15 different characters to encode. With just 4 bits we can encode 16 different characters, enough for our example and that would give us a total of 26 times 4 bits, which is 104 bits in total and thus a saving of 50%!

Can we go even further? Why use the same amount of bits for every character? Some characters, like the capital B appear only once in our sentence and other characters, like the i appear more often. If one could use less than four bits for letters like the i then maybe we can save a few bits more even if that makes uncommon letters like the capital B a bit more “expensive”. This is basically the idea of entropy coding of which Huffman coding is a simple to understand and very effective method. Before coming to Huffman coding though which David. A. Huffman published in 1952 we take a quick look at a coding scheme invented a century before: Morse code

Morse code already does what is suggested in the last paragraph: it uses less symbols to encode common letters at the cost of encoding uncommon letters. For example, the letter e is encoded by a single short pulse and a t by a single long pulse. Less common letters like the q use 4 pulses, which also is the maximum amount of pulses needed to encode a letter in Morse code. Since there are only two different symbols in Morse code, a long and a short pulse, can’t we just replace them by 0 and 1 and thus use at most 4 bits for a letter and far less for common letters? Unfortunately not: Morse code might consist of only two different audible pulses but it also needs pauses between each letter which translates to a third (inaudible) symbol used in transmission. A single bit however can only distinguish between two different values or pulses but not between three. Can’t we just omit the pauses giving us a stream of long and short pulses, of zeroes and ones? Again the answer is no because without pauses Morse code can not be decoded properly: for example the letter e is a single short pulse, the letter i is two short pulses and the letter s is three short pulses. Without a pause between two letters you can’t be sure if a sequence of three short pulses is actually the letter s or the two letters e and i. And that is not the only example in Morse code but one example too many to make it usable to us. In short: we don’t want a code in which a symbol is encoded into a sequence which also is the beginning of the encoded sequence of a different symbol. Beginnings of sequences are called prefixes. What we need is a prefix-free code.

Huffman Coding

In 1952 David A. Huffman presented an encoding scheme that combines the features we identified and needed above: use fewer bits for common characters (entropy encoding) and be prefix-free (a necessity for decoding). It works by creating a special “encoding alphabet” for the data to be compressed. That is, for each (different) letter or symbol in the sequence to compress it calculates a sequence of bits in such a way that if instead of the letters the bit sequences are used the resulting sequence of bits is the shortest possible (in any scheme that substitutes letter by letter).

It works by building what computer scientists call a binary tree. Basically a binary tree has one root which splits into two child nodes which themselves can be split into two child nodes and so on. If a node has no further “children” it is called a leaf.

diagram showing a simple tree

The little image to the right shows a simple tree in which node A is the root that splits into the inner node B and the leaf E while B splits into the two leafs C and D.

In Huffman coding the tree will be constructed in such a way that the (distinct) characters to be encoded will be the leaves of the tree. This is nice in two ways:

Beginning from the root, each step we go down to a leaf is either to the left or to the right (because it is a binary tree). Instead of left and right we can say 0 and 1 and thus give every letter a unique bit sequence that encodes it.
Since all the letters are leaves the path down from the root ends at a letter and doesn’t go further. This ensures the prefix-free property since now whenever a leaf is reached we automatically now that the next bit in our encoded sequence is the first bit of the next encoded character.

Construction of a Huffman tree

The tree will be constructed “upwards” from its leaves (in computer science the root is at the top, the leaves are at the bottom). For our example sentence the leaves are the letters “B”, “r”, “e”, “v”, “i”, “t”, “y”, “_”, “s”, “h”, “o”, “u”, “l”, “f”, “w”. Instead of the space character I will use the underscore (“_”) to make it more distinguishable.

Since Huffman coding optimizes the code length for more frequent characters the Huffman-algorithm does need to know about the frequency of the different letters. For our example we will just write the number of occurrences of each letter into our tree-nodes, together with the letter itself. Also, for reasons that become more clear soon, we will start with those characters that appear least often and order them by the number of their occurrences:

(Click the image for a larger version)

Now the algorithm works like this:

Select the two (parent-less) nodes with the lowest (total) character occurrences and create a new node that becomes the parent of those two nodes. Add the character occurrences and use it for the new node.
Go back to step one and continue until all nodes are making up a single tree with a single root

So what are the two nodes with the lowest total character occurrences? Obviously there are 9 nodes with “weight” 1 (I’ll be using “weight” instead of character occurrence from now on). We can put any two of them under a new node but for beauty’s sake we start with the leftmost two and unite the nodes “B” and “r” under a new node “B,r” with weight 2. We do that also for “v” and “y”, “h” and “l”, “u” and “f” and get the following graph. (I call it a graph because it isn’t a tree yet: a tree only got a single root but all the unconnected nodes are roots of themselves making up their own tree. A collection of unconnected trees is actually called… a forrest of course! But you can call everything with nodes and connections in between (which are called vertices) a graph as well).

(Click the image for a larger version)

Of the (parent-less) nodes we got only one left with weight 1. We can connect it to any node with weight 2 (since 2 is the next highest weight). We could choose one of our original node (and leaves) “e”, “s,”, “o” but we could also take any of our newly created weight-2 nodes. Our choice does of course alter the final Huffman tree but it won’t change the outcome in terms of encoded length (believe me or try to find a counter-example!). Again, for beauty’s and simplicities sake I will choose the next node in line and unite “w” with “e” to create a new weight-3 node “w,e”.
“s” and “o” are the only leaves left with weight 2 and although I could unite any of them with one of our newly created weight-2 nodes I rather join those two (because I know it will make the final tree look better).
Now the only leaves left are those with weight 3 and above (“i”, “t” and “_”) but we still got all of our four newly created weight-2 nodes left. Since the algorithm demands to first join those nodes with lowest weight we’ll just do so and join “B,r” with “v,y” and “h,l” with “u,f” creating two new weight-4 nodes “B,r,v,y” and “h,l,u,f” as the following graph shows:

(Click the image for a larger version)

This continues over and over again: always choose two nodes with lowest weight and unite them under a new node. Often there is more than one combination of nodes that can be combined (as already seen above), for example we now have three nodes with weight 3 left (“w,e”, “i” and “t”) and could combine any two of them. Re-iterating this procedure of joining (parent-less) least-weight-nodes until they are all united under a single root node you end up with a tree like this (or a different one, since whenever there is more than one way to join nodes you got free choice!):

(Click the image for a larger version)

The naming of the nodes does actually only matter for the leaves. The most important property is the weight. So it doesn’t really matter that the root node is undecipherable but you may notice that its weight of 26 is exactly the number of letters in our example sentence. This should be no surprise as we added up the occurrences of the letters when we went our way up the tree.

Using the Huffman tree to derive a minimal code

Now that we got the tree it’s easy to derive a binary encoding for each character. We’ll just start at the root node and whenever we go down to the left we add a 0 and whenever we go right we add a 1. When finally a leaf is reached we got its code as a sequence of 0’s and 1’s.
The following final tree with annotated bits shows the result of our efforts:
(Click the image for a larger version)

As you can see quite a few letters do now have a 5-bit sequence making them longer than our pre-Huffman attempts with 4 bits per character! However, those long codes are only used for letters that occur least often and the letters that make up most of the example sentence are shorter: the space or “_” only uses 2 bits!

In terms of the total encoded length we now get the following result:
The 8 5-bit codes occur only once each giving us a total of 40 bits. Of the 4-bit codes only the one for “w” occurs once, the three others occur twice each giving us another $4 + 3\cdot2\cdot4 = 28$ bits bringing the sum up to 68 bits. The two 3-bit codes occur 3 times each adding another 18 bits for a running total of 86. The space-character with its 2-bit code occurs 5 times, adding 10 bits for a final result of 96 bits. This is another improvement compared to the 104 bits we needed before! Saving those additional 8 bits might not sound like being worth the effort but for longer texts (or better examples) the savings add up!

The pros and the cons of Huffman coding

Huffman coding is one of the most simple compressing encoding schemes and can be implemented easily and efficiently. It also has the advantage of not being patented like other methods (e.g. arithmetic coding for example) which however are superior to Huffman coding in terms of resulting code length.

One thing not mentioned so far shall not be kept secret however: to decode our 96 bit of “brief wit” the potential receiver of the bit sequence does need the codes for all letters! In fact he doesn’t even know which letters are encoded at all! Adding this information, which is also called the “Huffman table” might use up more space than the original uncompressed sentence!
However: for longer texts the savings outweigh the added Huffman table length. One can also agree on a Huffman table to use that isn’t optimized for the exact text to be transmitted but is good in general. In the English language for example the letters “e” and “t” occur most often while “q” and “z” make up the least part of an average text and one can agree on one Huffman table to use that on average produces a good (=short) result. Once agreed upon it doesn’t have to be transmitted with every encoded text again.

One last thing to remember is that Huffman coding is not restricted to letters and text: it can be used for just any symbols, numbers or “abstract things” that can be assigned a bit sequence to. As such Huffman coding plays an important role in other compression algorithms like JPG compression for photos and MP3 for audio files.