© A. Tarpai

Source code for this scaled integer IDCT can be found on GitHub, as part of an ITU T.81 JPEG decoder I have written.

Using any of these scaling vectors the number of multiplications can be reduced from 11 to 6 in a modified 1-D transform based on the LLM method:

**v** = (1, η, β, γη, 1, γη, α, η)
**v** = (1, δ, β, γδ, 1, γδ, α, δ)
**v** = (1, θ, β, γθ, 1, γθ, α, θ)
**v** = (1, ε, β, γε, 1, γε, α, ε)

The motivation for these algorithms was to implement an acceptable and fast 8x8 integer IDCT for my JPEG and video decoders, without borrowing code. The focus is therefore on the inverse transform.

The inverse DCT is the very last stage of the image decoding process, which reconstructs the original image sample values. After coefficient decoding each value of the block is multiplied by the quantization table values (*de-quantization*). The transform itself is quite computation intensive, the 2-D formula is an iteration of 8x8x8x8 = 4096 multiplications with irrational numbers, based on the equation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Q-block . . . . . . . . IDCT . . . . . . . . . . . . . . . . ---------> . . . . . . . . ---------> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . coeff block de-quantized block image samples

By clever *pre-multiplication* (scaling) of the 8x8 input block before the IDCT, some calculations can be eliminated, others gets simpler by using a modified IDCT equation. Scaling is only efficient when it's performed together with the de-quantization step, that is the Q-block is already multiplied by the scaling block before the transform, the scaling matrix *SQ = S x Q*:

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . simple . . . . . . . . . . . . . . . . x SQ-block . . . . . . . . IDCT . . . . . . . . . . . . . . . . ---------> . . . . . . . . ---------> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . coeff block de-quantized image samples and scaled block

The 2-D IDCT above is tremendously slow to implement. Most of the fast DCT algorithms are based on the separability of the 2-D transform into successive 1-D row- and column transforms (or column- and row transform). The 8-point 1-D transform computes 8 outputs from 8 inputs based on and can be implemented as matrix multiplication. An iteration of 8x8 = 64 multiplications with irrational numbers. The transform is linear (this will be important later):

1-D transform X_{0}X_{1}X_{2}X_{3}X_{4}X_{5}X_{6}X_{7}---------------> x_{0}x_{1}x_{2}x_{3}x_{4}x_{5}x_{6}x_{7}

This transform function is applied 16 times for the 8x8 block as 8 row- and 8 column-transforms (16 x 64 = 1024 multiplication). Usually implementations use only one 1-D routine working on input and output arrays. In order to use the same function, the result is written transposed through a temp-block:

0 1 2 3 4 5 6 7 0 . . . . . . . 0 1 2 3 4 5 6 7 . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . 1D IDCT 2 . . . . . . . 1D IDCT . . . . . . . . . . . . . . . . ---------> 3 . . . . . . . ---------> . . . . . . . . . . . . . . . . row 4 . . . . . . . column . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . 6 . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . .

Among the endless number of fast algorithms I've been fascinated by Löffler's paper from 1989, its clean and simple solution. It gives a method the compute the 8-point 1-D transform with 11 multiplications:

Image source: Reznik, Hindsy, Zhangz, Yuz, and Ni, Efficient Fixed-Point Approximations of the 8x8 Inverse Discrete Cosine Transform

As you can see there are no multiplications needed for X_{0} and X_{4}. This is because LLM *uniformly scales* the orthonormal DCT matrix by √8, i.e. the DCT matrix's vectors are of lenght √8 (instead of 1 as for an orthonormal matrix). This trick makes v_{0} and v_{4} all having coordinates of 1. Applying this DCT matrix twice (row/column method) gives results multiplied by 8 = (√8)^{2} - which is easy to implement as right shift to get the correct values after the transform.

LLM uses these 7 irrational constants:

γ = √2 ≈ 1.414213562

α = √2cos(3π/8) ≈ 0.5411961

β = √2sin(3π/8) ≈ 1.306562965

η = cos(3π/16) ≈ 0.831469612

θ = sin(3π/16) ≈ 0.555570233

δ = cos(π/16) ≈ 0.98078528

ε = sin(π/16) ≈ 0.195090322

After rearranging the order of inputs 2 basic structures reveal:

- an adder (blue)
- and a butterfly-multiplier (red):

The adder takes 4 inputs (a, b, c and d) and computes 4 new outputs according to:

p = a + b n = a - b a' = p + d b' = n + c c' = n - c d' = p - d

The multiplier is the butterfly or rotation. It takes 2 inputs and 2 irrational constants. The constants are cos/sin pairs of α/β, η/θ or δ/ε; denoted here as **K/S**. Two outputs are computed according to the equation:

a' = Ka + Sb b' = Kb - Sa

The equation needs 4 multiplications to evaluate. As in the original paper, it can be reduced to 3 using intermediates. There are 6 possibilities:

c = K(a+b) c = S(a+b) a' = c - b(K-S) a' = a(K-S) + c b' = c - a(K+S) b' = b(K+S) - c c = K(a-b) c = S(a-b) a' = b(K+S) + c a' = a(K+S) - c b' = a(K-S) - c b' = b(K-S) - c c = K(b-a) c = S(b-a) a' = b(K+S) - c a' = a(K+S) + c b' = a(K-S) + c b' = b(K-S) + c

These are mathematically equivalent but chosing one of them have an impact when designing the scaling matrix and the integer IDCT implementation. The first column is *K-type*, where the 3 new constants are K, K+S and K-S, while the second column the *S-type*, there the 3 new constants are S, K+S and K-S. The adder and the multiplier can be implemented as macros (`XADD`

, `KROT/SROT`

).

First lets see how to eliminate the √2 multiplication of F_{3} and F_{5} inside a simplified LLM 1D-transform.

+-------------+ +----------+ F_{0}-|----> -|--> f_{0}F_{0}----|-> -|--> f_{0}F_{1}-|----> -|--> f_{1}F_{1}----|-> -|--> f_{1}F_{2}-|----> -|--> f_{2}F_{2}----|-> -|--> f_{2}F_{3}-|-√2-> -|--> f_{3}---> √2F_{3}----|-> -|--> f_{3}F_{4}-|----> -|--> f_{4}F_{4}----|-> -|--> f_{4}F_{5}-|-√2-> -|--> f_{5}√2F_{5}----|-> -|--> f_{5}F_{6}-|----> -|--> f_{6}F_{6}----|-> -|--> f_{6}F_{7}-|----> -|--> f_{7}F_{7}----|-> -|--> f_{7}+-------------+ +----------+ original modified

The solution uses scaled inputs: that is F_{3} and F_{5} entering the 1-D transform is already multiplied by √2. Because if the 2-D transform is successive row- the column transforms with transposed output (using the same function for both), understanding the scaling matrix is a little complicated. So lets go backwards:

1. The modified 1-D transform of the second block works correctly when F_{3} and F_{5} of each row is already √2-multiplied:

0 1 2 3 4 5 6 7 . . . + . + . . . . . . . . . . . . . + . + . . . . . . . . . . . . . + . + . . 8 x 1D . . . . . . . . -----> . . . + . + . . -----> . . . . . . . . . . . + . + . . tr.p . . . . . . . . . . . + . + . . . . . . . . . . . . . + . + . . . . . . . . . . . . . + . + . . . . . . . . . . output

2. These two columns are the result of the modified 1-D transform of row 3 and 5 of the previous block (transposed output). Because the transform is linear, obtaining all outputs scaled by √2 can be achieved by scaling all inputs by √2 before the transform, using √2f(**x**) = f(√2**x**), where **x** is the (row) vector:

0 1 2 3 4 5 6 7 0 . . . . . . . . . . . + . + . . . . . . . . . . 1 . . . . . . . . . . . + . + . . . . . . . . . . 2 . . . . . . . . 8 x 1D . . . + . + . . 8 x 1D . . . . . . . . 3 + + + + + + + + -----> . . . + . + . . -----> . . . . . . . . 4 . . . . . . . . . . . + . + . . . . . . . . . . 5 + + + + + + + + . . . + . + . . . . . . . . . . 6 . . . . . . . . . . . + . + . . . . . . . . . . 7 . . . . . . . . . . . + . + . . . . . . . . . . output

3. But the first block transform using the modified 1-D transform also expects F_{3} and F_{5} of each row multiplied by √2. In row 3 and 5, where all inputs are already scaled by √2, F_{3} and F_{5} will be *double scaled*, ie. (√2)² = 2, the intersection points marked by **#**:

0 1 2 3 4 5 6 7 0 . . . + . + . . . . . + . + . . . . . . . . . . 1 . . . + . + . . . . . + . + . . . . . . . . . . 2 . . . + . + . . 8 x 1D . . . + . + . . 8 x 1D . . . . . . . . 3 + + + # + # + + -----> . . . + . + . . -----> . . . . . . . . 4 . . . + . + . . . . . + . + . . . . . . . . . . 5 + + + # + # + + . . . + . + . . . . . . . . . . 6 . . . + . + . . . . . + . + . . . . . . . . . . 7 . . . + . + . . . . . + . + . . . . . . . . . . output

This gives our scaling matrix, **S**, to multiply values of the 8x8 coefficient block before applying the modified IDCT twice, first on rows then on colums, thus eliminating the √2 multiplication of F_{3} and F_{5} inside the modified LLM 1-D transform:

1 1 1 √2 1 √2 1 1 1 1 1 √2 1 √2 1 1 1 1 1 √2 1 √2 1 1 S = √2 √2 √2 2 √2 2 √2 √2 1 1 1 √2 1 √2 1 1 √2 √2 √2 2 √2 2 √2 √2 1 1 1 √2 1 √2 1 1 1 1 1 √2 1 √2 1 1

**S** is also the result of a matrix multiplication of the column vector **v** and its transposed vector **v**^{T}:

**S** = **v v**^{T},

where **v** = (1, 1, 1, √2, 1, √2, 1, 1):

| 1 1 1 √2 1 √2 1 1 (v^{T}) ---|--------------------------------------------- | 1 | 1 1 1 √2 1 √2 1 1 1 | 1 1 1 √2 1 √2 1 1 1 | 1 1 1 √2 1 √2 1 1 √2 | √2 √2 √2 2 √2 2 √2 √2 = S 1 | 1 1 1 √2 1 √2 1 1 √2 | √2 √2 √2 2 √2 2 √2 √2 1 | 1 1 1 √2 1 √2 1 1 1 | 1 1 1 √2 1 √2 1 1 (v) |

This essentially means that we have to focus only on finding a *good* scaling vector for a modified LLM transform, then we can simply obtain the final scaling matrix by computing **v v ^{T}** for the 2-D row/column method. Integrating this scaling matrix (

For X_{2} and X_{6} the even part starts with a butterfly multiplication. The 2 irrational constants (K and S) are:

K = √2cos(3π/8) S = √2sin(3π/8)

The outputs are computed according to the equation:

x_{6}= K X_{6}+ S X_{2}x_{2}= K X_{2}- S X_{6}

We can obtain x_{6} without multiplication inside a modified LLM transform when the inputs are scaled: let W_{6} = K X_{6} and W_{2} = S X_{2}. Then x_{6} is simply

x_{6}= W_{6}+ W_{2}

while for x_{2} we get:

x_{2}= K/S W_{2}- S/K W_{6}

The constants of K/S and S/K are trigonometric identities:

K/S = √2cos(3π/8) / √2sin(3π/8) = cot(3π/8) = √2 - 1 S/K = √2sin(3π/8) / √2cos(3π/8) = tan(3π/8) = √2 + 1

then

x_{2}= (√2 - 1) W_{2}- (√2 + 1) W_{6}= √2 W_{2}- W_{2}- √2 W_{6}- W_{6}= √2 ( W_{2}- W_{6}) - W_{2}- W_{6}

**That is, the scaled even part needs only one multiplication of √2.**

I discovered this accidently by looking at values in the Excel sheet I use to calculate factors. Indeed sin(3π/8) = ½√(2+√2) and cos(3π/8) = ½√(2-√2). Then K/S = √( (2-√2)/(2+√2) ) = √( (2-√2)²/(4-2) ) = √ ( (4-4√2+2) / 2 ) = √ ( 2-2√2+1 ) = √ ( (√2-1)² ) = √2-1. S/K is similar.

Now we can include α and β in the scaling vector **v**, and use simplified equations for computing the rotation for the even part:

**v** = (1, 1, β, γ, 1, γ, α, 1) = ( 1, 1, √2sin(3π/8), √2, 1, √2, √2cos(3π/8), 1 )

The odd-part starts with an adder: that is all 4 inputs must be equally scaled. That's a problem for simplifying the two butterfly multiplications for η/θ and δ/ε that follows. The task is to find a common factor, *r*, which may simplify these two butterfly multiplications. When all 4 inputs of X_{1}, X_{7}, X_{5} and X_{3} are pre-scaled by *r*, then the equations become:

x_{1} = η/r X_{1} + θ/r X_{7}

x_{7} = η/r X_{7} - θ/r X_{1}

x_{5} = δ/r X_{5} + ε/r X_{3}

x_{3} = δ/r X_{3} - ε/r X_{5}

The only thing I could do is to chose one of the constants for pre-scaling - thus reducing the number of multiplications in one of the butterfly to 2. F. ex. when r = η we get:

x_{1} = X_{1} + θ/η X_{7}

x_{7} = X_{7} - θ/η X_{1}

x_{5} = δ/η X_{5} + ε/η X_{3}

x_{3} = δ/η X_{3} - ε/η X_{5}

Furthermore, using one of the 3-mul equations for x_{5}/x_{3}, the total number of multiplications will be 5. F. ex. using the S(b-a) intermediate:

c = ε/η (x_{3} - x_{5})

x_{5} = c + ( δ/η + ε/η ) X_{5}

x_{3} = c + ( δ/η - ε/η ) X_{3}

Including r = η in the scaling vector **v** for the odd part leads one of the the final solutions:

**v** = (1, η, β, γη, 1, γη, α, η)

and a modified 1-D LLM transform with 6 multiplications.

The last equations reveal something promising: δ + ε and δ - ε are the sum and difference of cos(x) and sin(x) of the same angle and are trigonometric identities! I was hoping these equations might help the IDCT calculation, somehow, especially in the scaled version. Not really. The complexity remains the same, but it is an interesting investigation:

The odd part *runs* on the π/16 line and there seem to be almost no relationship between sin/cos(π/16) and sin/cos(3π/16). Almost.

I read this somewhere:

sin(x) + cos(x) = √2 sin( x + π/4)

I always wonder about trigonometric identities, how come is this true? It's pretty hard to see this with geometry, drawing triangles, so lets plot these 3 functions using Fooplot for example:

**Plotting sin(x) + cos(x):**

The fig above shows sin(x), cos(x) and sin(x)+cos(x). Ok, periodic, same as sin/cos on 2π. Shifted on the x axis by +/- π/4 compared to sin and cos. The maximum is at π/4, where both sin(π/4) = cos(π/4) = √2/2, ergo max = √2.

So sin(x) + cos(x) is something like:

sin(x) + cos(x) = √2 sin(x + π/4) or

sin(x) + cos(x) = √2 cos(x - π/4) or

sin(x) + cos(x) = √2 sin(x - 7π/4) or

sin(x) + cos(x) = √2 cos(x + 7π/4) or

sin(x) + cos(x) = √2 cos(-x + π/4).. and so on.

Why is this interesting? Because in the LLM transform both δ/ε (π/16) and η/θ (3π/16) are multiplies of π/16 angles. Adding π/4 to π/16 is 5π/16, which is *symmetric* to 3π/16, sin(3π/16)=cos(5π/16) and cos(3π/16)=sin(5π/16). Now using the above equations we can state the following identity for the sum of δ + ε:

**δ + ε** = cos(π/16) + sin(π/16) = √2 sin(π/16 + π/4) = √2 sin(5π/16) = √2 cos(3π/16) = **√2 η**

and similarly for the sum of η + θ:

**η + θ** = cos(3π/16) + sin(3π/16) = √2 sin(3π/16 + π/4) = √2 sin(7π/16) = √2 cos(π/16) = **√2 δ**

**Plotting cos(x) - sin(x):**

For the other multiplier in the 3-mul butterfly version K-S appears, lets make identities for these two in regard of the constants of the LLM transform.

cos(x) - sin(x) = √2 sin(x + 3π/4)

cos(x) - sin(x) = √2 cos(x + π/4)

cos(x) - sin(x) = √2 sin(x - 5π/4)

cos(x) - sin(x) = √2 cos(x - 7π/4)

Now using the above we can state the following:

**δ - ε** = cos(π/16) - sin(π/16) = √2 cos(π/16 + π/4) = √2 cos(5π/16) = √2 sin(3π/16) = **√2 θ**

and

**η - θ** = cos(3π/16) - sin(3π/16) = √2 cos(3π/16 + π/4) = √2 cos(7π/16) = √2 sin(π/16) = **√2 ε**

Giving som useful identities for the LLM transform:

** δ + ε = √2 η **
** δ - ε = √2 θ **
** ε - δ = -√2 θ **

and

** η + θ = √2 δ **
** η - θ = √2 ε **
** θ - η = -√2 ε **

Again, I just could't find a better solution even with these identities for the scaled LLM transform.

Using scaled coefficients before the IDCT can be a significant improvement in speed.

The idea is that scaling happens once for the whole 8x8 block together with dequantization (both are multiplications), simplifying computations for the *modified* 1D transform function. This is called a scaled-quantization table.

For the integer version, only an extra post-scale stage is required in the form of right shift:

Scaled dequantization --> modified IDCT --> right shift

When using integer arithmetics, a good approximation of **v** is used, where **V** = INT (2^{n} **v**). After the second 1D-transform, all values are shifted right by *n*:

- pre-scale (multiplication)
- row-transform
- column-transform
- post-scale (right shift)

And an example integer matrix for scaled x3 and x5 by √2. Here n=7 and √2 × 128 = 181,019336 seems like a good integer approximation:

128 128 128 181 128 181 128 128 128 128 128 181 128 181 128 128 128 128 128 181 128 181 128 128 181 181 181 256 181 256 181 181 128 128 128 181 128 181 128 128 181 181 181 256 181 256 181 181 128 128 128 181 128 181 128 128 128 128 128 181 128 181 128 128

As for the modified LLM transform the final post-scale will be 7 + 3 = 10.

- ^ LLM 1989 (C. Loeffler, A. Ligtenberg, and G. S. Moschytz, Practical fast 1-D DCT algorithms with 11 multiplications

Fri Jun 21 11:15:00 UTC+0200 2019 © A. Tarpai