Huffman coding

© 2015 Attila Tarpai (tarpai76 gmail)

As in JPEG and PNG Deflate.

Intro

Huffman coding assigns variable-bit-length codes to symbols (see prefix-free codes).

The more frequent the symbol is in the original data the shorter code it gets (thus achieving compression).

Symbols represent the original data and come from an Alphabet. Eg. bytes: an alphabet of 8-bit values consists of symbols of 0 to 255.

The Decoder needs a 'table' of Symbol-Length-Code assignments in some way, f.ex.:

Symbol  Length  Code
------  ------  ----
 A       3       010
 B       3       011
 C       3       100
 D       3       101
 E       3       110
 F       2        00
 G       4      1110
 H       4      1111     example from RFC1951

Using the same table encoders/decoders can compress/decompress the same data. This also means that the Symbol-Length-Code table is part of the compressed data:

             encoding
          -------------->
symbol                       Huffman code
          <--------------
             decoding

Huffman codes as binary trees

This is because of binary computer storage: Huffman codes in this world are zeroes and ones (bits). We can draw a tree and place coded symbols on the leaves. Because Huffman codes are prefix-free it's guarantied that there are no symbols 'under' another (meaning no two codes share the same prefix, beginning).

Here is the binary tree for the RFC1951 example above. Huffman codes are in parethesis. An edge represents a new bit, 0 or 1, which is added to the code 'from the right'. The new code is either a symbol (leaf) or a branch and we read another bit:

                               /                    \                                              
                              /                      \                                             
                            (0)                      (1)                                           
                           /   \                   /     \                                          
                          /     \                 /       \                                         
                         /       \               /         \                                        
                        /         \             /           \                                       
                      (00)        (01)        (10)          (11)                                     
                       F          / \         /  \          / \                                     
                                 /   \       /    \        /   \                                     
                                /     \     /      \      /     \                                    
                              (010) (011) (100) (101)  (110)  (111)    
                               A      B     C     D      E     /  \   
                                                              /    \  
                                                             /      \
                                                          (1110)  (1111)                      
                                                            G        H

During decoding we walk this tree based on bits read from the compressed data to get to a symbol.

There are 2 issues here:

Huffman tables efficiently during encoding/decoding?

Huffman table definition

Symbol-Length-Code assignments will be part of the compressed data. To be efficient, there are rules how codes and symbols are assigned to help carrying them as part of the compressed data:

As in the example above - for those 8 symbols - the encoder has decided that 'F' is the most frequent symbol in the original data and gives it a 2-bit lenght code: (00). A, B, C, D and E seemed equal frequent, they get a 3-bit lenght code starting from (010), etc. If 'F' gets a 1-bit code, (0), the symbol compresses better, but the rest of the symbols has to share among longer codes. So there is a fine balance on the encoder side how symbol-code assignments are made.

On the decoder side, keeping those rules, the only information needed is the number of symbols per code-length and the symbols itself. JPEG and PNG Deflate HTD is a little different, but the decoder (at least mine) will build the same structure for bit-by-bit decoding.

JPEG Huffman table definition

= an array of number of symbols per code-length starting from 1 up to 16 (max code lenght in JPEG) and the symbols itself (JPEG doesn't need rule #2)

For the RFC1951 example above this would be the JPEG-style HTD:

0, 1, 5, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

F, A, B, C, D, E, G, H

Building the Symbol-Length-Code table:

Deflate Huffman table definition

= an array of code-length for every symbol

For the RFC1951 example above this would be the PNG Deflate style HTD:

3, 3, 3, 3, 3, 2, 4, 4

In order to make this work, Deflate has to pre-define the alphabet. For the example above this is ABCDEFGH, 8 symbols, and gives 8 code-lenght values in HTD. (For the real Deflate, the alphabet is bytes: 256 code-lenght values for 0 up to 255.)

Building the Symbol-Length-Code table is a little more work - we pay for this simple HTD:


Eventually this also builds the original Symbol-Length-Code table.

Huffman decoding

This is one possible implementation of a Huffman decoder, when reading the compressed data bit-by-bit.

It's based on the JPEG style HTD above, where an array of N (number of symbols per code-length) and the array of S (symbol values in packed form) are given.

Grouping symbols by code-length:

N:     0, | 1, | 5,             | 2,    | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
          |    |                |       | 
S:        | F, | A, B, C, D, E, | G, H  | 
          |    |                |       |      
Sidx:     | 0  | 1  2  3  4  5  | 6  7  |      

The goal is to find Sidx quickly reading the input bits.

Lets investigate the Huffman binary tree for this example, extending it with corresponding symbols and Sidx:

                               /                    \                                              
                              /                      \                                             
                            (0)                      (1)                                           
                           /   \                   /     \                                          
                          /     \                 /       \                                         
                         /       \               /         \                                        
                        /         \             /           \                                       
 Huffman code --->    (00)        (01)        (10)          (11)    
 symbol --------->     F          / \         /  \          / \                                     
 Sidx ----------->    [0]        / 1 \       / 2  \        / 3 \                                     
                                /     \     /      \      /     \                                    
                              (010) (011) (100) (101)  (110)  (111)    
                               A      B     C     D      E     /  \   
                              [1]    [2]   [3]   [4]    [5]   / 6  \  
                                                             /      \
                                                          (1110)  (1111)  
                                                            G        H
                                                           [6]      [7] 

The idea is using Sidx itself as code during Huffman decoding. On each level:

This routine needs only the B-values computed from the N-array as running sum:

  N    B
 ---  -----
  0    0
  1    1
  5    6           B[i]= B[i-1] + N[i]
  2    8
  0    8
  0    8
  0    8
  ..   ..

It doesn't return the Huffman coded symbol, only Sidx. The symbol itself is at S[Sidx]:

int ReadHuffman(int *b)
{                                                                                                         
    int code= Get1bit();                                                                                  
    while ( code >= *b ) code= 2*code + Get1bit() - *b++;                                                   
    return code;                                                                                             
}                                                                                                             

calling the function in code: 

	...
	symbol= S[ReadHuffman(B)];

I've been using this method both in my JPEG and PNG decoders. I didn't break yet..