Huffman coding

© 2015 A. Tarpai (tarpai76 gmail)

As in JPEG and PNG Deflate.

Intro

Huffman coding assigns variable-bit-length codes to symbols (see prefix-free codes).

The more frequent the symbol is in the original data the shorter code it gets (thus achieving compression).

Symbols represent the original data and come from an Alphabet. Eg. bytes: an alphabet of 8-bit values consists of symbols of 0 to 255.

The Decoder needs a 'table' of Symbol-Length-Code assignments in some way, f.ex.:

```Symbol  Length  Code
------  ------  ----
A       3       010
B       3       011
C       3       100
D       3       101
E       3       110
F       2        00
G       4      1110
H       4      1111     example from RFC1951
```

Using the same table encoders/decoders can compress/decompress the same data. This also means that the Symbol-Length-Code table is part of the compressed data:

```             encoding
-------------->
symbol                       Huffman code
<--------------
decoding
```

Huffman codes as binary trees

This is because of binary computer storage: Huffman codes in this world are zeroes and ones (bits). We can draw a tree and place coded symbols on the leaves. Because Huffman codes are prefix-free it's guarantied that there are no symbols 'under' another (meaning no two codes share the same prefix, beginning).

Here is the binary tree for the RFC1951 example above. Huffman codes are in parethesis. An edge represents a new bit, 0 or 1, which is added to the code 'from the right'. The new code is either a symbol (leaf) or a branch and we read another bit:

```                               /                    \
/                      \
(0)                      (1)
/   \                   /     \
/     \                 /       \
/       \               /         \
/         \             /           \
(00)        (01)        (10)          (11)
F          / \         /  \          / \
/   \       /    \        /   \
/     \     /      \      /     \
(010) (011) (100) (101)  (110)  (111)
A      B     C     D      E     /  \
/    \
/      \
(1110)  (1111)
G        H
```

During decoding we walk this tree based on bits read from the compressed data to get to a symbol.

There are 2 issues here:

• how to define
• how to use

Huffman tables efficiently during encoding/decoding?

Huffman table definition

Symbol-Length-Code assignments will be part of the compressed data. To be efficient, there are rules how codes and symbols are assigned to help carrying them as part of the compressed data:

• equal bit-length codes populate from the left on consecutive code values
• equal bit-length symbols are assigned by their natural order (this is for Deflate)

As in the example above - for those 8 symbols - the encoder has decided that 'F' is the most frequent symbol in the original data and gives it a 2-bit lenght code: (00). A, B, C, D and E seemed equal frequent, they get a 3-bit lenght code starting from (010), etc. If 'F' gets a 1-bit code, (0), the symbol compresses better, but the rest of the symbols has to share among longer codes. So there is a fine balance on the encoder side how symbol-code assignments are made.

On the decoder side, keeping those rules, the only information needed is the number of symbols per code-length and the symbols itself. JPEG and PNG Deflate HTD is a little different, but the decoder (at least mine) will build the same structure for bit-by-bit decoding.

JPEG Huffman table definition

= an array of number of symbols per code-length starting from 1 up to 16 (max code lenght in JPEG) and the symbols itself (JPEG doesn't need rule #2)

For the RFC1951 example above this would be the JPEG-style HTD:

```0, 1, 5, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

F, A, B, C, D, E, G, H
```

Building the Symbol-Length-Code table:

• there are no codes of 1-bit-lenght
• there is 1 code of 2-bit-lenght, that is 'F', gets the first available 2-bit-lenght code:
• F: (00)
• there are 5 codes of 3-bit-lenght, the first available 3-bit-lenght code is (010), so
• A: (010)
• B: (011)
• C: (100)
• D: (101)
• E: (110)
• there are 2 codes of 4-bit-lenght, the first available 4-bit-lenght code is (1110):
• H: (1110)
• H: (1111)
• no more codes/symbols (rest is all zero)

Deflate Huffman table definition

= an array of code-length for every symbol

For the RFC1951 example above this would be the PNG Deflate style HTD:

```3, 3, 3, 3, 3, 2, 4, 4
```

In order to make this work, Deflate has to pre-define the alphabet. For the example above this is ABCDEFGH, 8 symbols, and gives 8 code-lenght values in HTD. (For the real Deflate, the alphabet is bytes: 256 code-lenght values for 0 up to 255.)

Building the Symbol-Length-Code table is a little more work - we pay for this simple HTD:

• assume there are symbols with 1-bit-lenght:
• search the array for 1
• none
• assume there are symbols with 2-bit-lenght:
• search the array for 2
• found one at position [5]
• this is 'F'
• F: (00) gets the first available 2-bit-lenght code
• search the array for 2 further
• none
• assume there are symbols with 3-bit-lenght:
• search the array for 3
• found one at position [0]
• this is 'A'
• A: (010) gets the first available 3-bit-lenght code
• search the array for 3 further
• found another one at position [1]
• this is 'B'
• ... etc.
• ...
• ...
• ...
• assume there are symbols with 15-bit-lenght:
• search the array for 15
• ...
• done (Deflate max code lenght is 15)

Eventually this also builds the original Symbol-Length-Code table.

Huffman decoding

This is one possible implementation of a Huffman decoder, when reading the compressed data bit-by-bit.

It's based on the JPEG style HTD above, where an array of N (number of symbols per code-length) and the array of S (symbol values in packed form) are given.

Grouping symbols by code-length:

```N:     0, | 1, | 5,             | 2,    | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
|    |                |       |
S:        | F, | A, B, C, D, E, | G, H  |
|    |                |       |
Sidx:     | 0  | 1  2  3  4  5  | 6  7  |
```

The goal is to find Sidx quickly reading the input bits.

Lets investigate the Huffman binary tree for this example, extending it with corresponding symbols and Sidx:

```                               /                    \
/                      \
(0)                      (1)
/   \                   /     \
/     \                 /       \
/       \               /         \
/         \             /           \
Huffman code --->    (00)        (01)        (10)          (11)
symbol --------->     F          / \         /  \          / \
Sidx ----------->    [0]        / 1 \       / 2  \        / 3 \
/     \     /      \      /     \
(010) (011) (100) (101)  (110)  (111)
A      B     C     D      E     /  \
[1]    [2]   [3]   [4]    [5]   / 6  \
/      \
(1110)  (1111)
G        H
[6]      [7]
```

The idea is using Sidx itself as code during Huffman decoding. On each level:

• compare code against B[i], where B[i]= B[i-1] + N[i] (cumulative sum of N, base indices)
• when code < B[i] we got Sidx (code in square-brackets)
• otherwise (code without square-brackets) read another bit: code= 2*code + Get1bit()
• and 'move' to next level: code-=B[i]

This routine needs only the B-values computed from the N-array as running sum:

```  N    B
---  -----
0    0
1    1
5    6           B[i]= B[i-1] + N[i]
2    8
0    8
0    8
0    8
..   ..
```

It doesn't return the Huffman coded symbol, only Sidx. The symbol itself is at S[Sidx]:

```int ReadHuffman(int *b)
{
int code= Get1bit();
while ( code >= *b ) code= 2*code + Get1bit() - *b++;
return code;
}

calling the function in code:

...