© 2015 Attila Tarpai (tarpai76 gmail)

As in JPEG and PNG Deflate.

Huffman coding assigns variable-bit-length codes to symbols (see prefix-free codes).

The more frequent the symbol is in the original data the shorter code it gets (thus achieving compression).

Symbols represent the original data and come from an Alphabet. Eg. bytes: an **alphabet** of 8-bit values consists of **symbols** of 0 to 255.

The Decoder needs a 'table' of Symbol-Length-Code assignments in some way, f.ex.:

Symbol Length Code ------ ------ ---- A 3 010 B 3 011 C 3 100 D 3 101 E 3 110 F 2 00 G 4 1110 H 4 1111 example from RFC1951

Using the same table encoders/decoders can compress/decompress the same data. This also means that the Symbol-Length-Code table is part of the compressed data:

encoding --------------> symbol Huffman code <-------------- decoding

This is because of binary computer storage: Huffman codes in this world are zeroes and ones (bits). We can draw a tree and place coded symbols on the leaves. Because Huffman codes are prefix-free it's guarantied that there are no symbols 'under' another (meaning no two codes share the same prefix, *beginning*).

Here is the binary tree for the RFC1951 example above. Huffman codes are in parethesis. An edge represents a new bit, 0 or 1, which is added to the code 'from the right'. The new code is either a symbol (leaf) or a branch and we read another bit:

/ \ / \ (0) (1) / \ / \ / \ / \ / \ / \ / \ / \ (00) (01) (10) (11) F / \ / \ / \ / \ / \ / \ / \ / \ / \ (010) (011) (100) (101) (110) (111) A B C D E / \ / \ / \ (1110) (1111) G H

During decoding we walk this tree based on bits read from the compressed data to get to a symbol.

There are 2 issues here:

- how to define
- how to use

Huffman tables efficiently during encoding/decoding?

Symbol-Length-Code assignments will be part of the compressed data. To be efficient, there are rules how codes and symbols are assigned to help carrying them as part of the compressed data:

- equal bit-length codes populate
*from the left*on consecutive code values - equal bit-length symbols are assigned by their natural order (this is for Deflate)

As in the example above - for those 8 symbols - the encoder has decided that 'F' is the most frequent symbol in the original data and gives it a 2-bit lenght code: (00). A, B, C, D and E seemed equal frequent, they get a 3-bit lenght code starting from (010), etc. If 'F' gets a 1-bit code, (0), the symbol compresses better, but the rest of the symbols has to share among longer codes. So there is a fine balance on the encoder side how symbol-code assignments are made.

On the decoder side, keeping those *rules*, the only information needed is the number of symbols per code-length and the symbols itself. JPEG and PNG Deflate HTD is a little different, but the decoder (at least mine) will build the same structure for bit-by-bit decoding.

= an array of number of symbols per code-length starting from 1 up to 16 (max code lenght in JPEG) and the symbols itself (JPEG doesn't need rule #2)

For the RFC1951 example above this would be the JPEG-style HTD:

0, 1, 5, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 F, A, B, C, D, E, G, H

Building the Symbol-Length-Code table:

- there are no codes of 1-bit-lenght
- there is 1 code of 2-bit-lenght, that is 'F', gets the first available 2-bit-lenght code:
- F: (00)

- there are 5 codes of 3-bit-lenght, the first available 3-bit-lenght code is (010), so
- A: (010)
- B: (011)
- C: (100)
- D: (101)
- E: (110)

- there are 2 codes of 4-bit-lenght, the first available 4-bit-lenght code is (1110):
- H: (1110)
- H: (1111)

- no more codes/symbols (rest is all zero)

= an array of code-length for every symbol

For the RFC1951 example above this would be the PNG Deflate style HTD:

3, 3, 3, 3, 3, 2, 4, 4

In order to make this work, Deflate has to pre-define the alphabet. For the example above this is ABCDEFGH, 8 symbols, and gives 8 code-lenght values in HTD. (For the *real* Deflate, the alphabet is bytes: 256 code-lenght values for 0 up to 255.)

Building the Symbol-Length-Code table is a little more work - we pay for this simple HTD:

- assume there are symbols with 1-bit-lenght:
- search the array for 1
- none

- assume there are symbols with 2-bit-lenght:
- search the array for 2
- found one at position [5]
- this is 'F'
- F: (00) gets the first available 2-bit-lenght code

- search the array for 2 further
- none

- search the array for 2
- assume there are symbols with 3-bit-lenght:
- search the array for 3
- found one at position [0]
- this is 'A'
- A: (010) gets the first available 3-bit-lenght code

- search the array for 3 further
- found another one at position [1]
- this is 'B'
- ... etc.

- search the array for 3
- ...
- ...
- ...
- assume there are symbols with 15-bit-lenght:
- search the array for 15
- ...

- done (Deflate max code lenght is 15)

Eventually this also builds the original Symbol-Length-Code table.

This is one possible implementation of a Huffman decoder, when reading the compressed data bit-by-bit.

It's based on the JPEG style HTD above, where an array of N (number of symbols per code-length) and the array of S (symbol values in packed form) are given.

Grouping symbols by code-length:

N: 0, | 1, | 5, | 2, | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 | | | | S: | F, | A, B, C, D, E, | G, H | | | | | Sidx: | 0 | 1 2 3 4 5 | 6 7 |

The goal is to find Sidx quickly reading the input bits.

Lets investigate the Huffman binary tree for this example, extending it with corresponding symbols and Sidx:

/ \ / \ (0) (1) / \ / \ / \ / \ / \ / \ / \ / \ Huffman code ---> (00) (01) (10) (11) symbol ---------> F / \ / \ / \ Sidx -----------> [0] / 1 \ / 2 \ / 3 \ / \ / \ / \ (010) (011) (100) (101) (110) (111) A B C D E / \ [1] [2] [3] [4] [5] / 6 \ / \ (1110) (1111) G H [6] [7]

The idea is using Sidx itself as code during Huffman decoding. On each level:

- compare code against B[i], where B[i]= B[i-1] + N[i] (cumulative sum of N,
*base*indices) - when code < B[i] we got Sidx (code in square-brackets)
- otherwise (code without square-brackets) read another bit: code= 2*code + Get1bit()
- and 'move' to next level: code-=B[i]

This routine needs only the B-values computed from the N-array as running sum:

N B --- ----- 0 0 1 1 5 6 B[i]= B[i-1] + N[i] 2 8 0 8 0 8 0 8 .. ..

It doesn't return the Huffman coded symbol, only Sidx. The symbol itself is at S[Sidx]:

int ReadHuffman(int *b) { int code= Get1bit(); while ( code >= *b ) code= 2*code + Get1bit() - *b++; return code; } calling the function in code: ... symbol= S[ReadHuffman(B)];

I've been using this method both in my JPEG and PNG decoders. I didn't break yet..