http://halicery.com/IDCT/Scaled IDCT.html

Scaled LLM[1] 1-D IDCT with 6 multiplications

© A. Tarpai

Source code for this scaled integer IDCT can be found on BitBucket, as part of an ITU T.81 JPEG decoder I have written.

Results

Using any of these scaling vectors the number of multiplications can be reduced from 11 to 6 in a modified 1-D transform based on the LLM method:

v = (1, η, β, γη, 1, γη, α, η)
v = (1, δ, β, γδ, 1, γδ, α, δ)
v = (1, θ, β, γθ, 1, γθ, α, θ)
v = (1, ε, β, γε, 1, γε, α, ε)

The motivation for these algorithms was to implement an acceptable and fast 8x8 integer IDCT for my JPEG and video decoders, without borrowing code. The focus is therefore on the inverse transform.

Principles of scaled IDCT

The inverse DCT is the very last stage of the image decoding process, which reconstructs the original image sample values. After coefficient decoding each value of the block is multiplied by the quantization table values (de-quantization). The transform itself is quite computation intensive, the 2-D formula is an iteration of 8x8x8x8 = 4096 multiplications with irrational numbers, based on the equation

:

.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .     x Q-block     .  .  .  .  .  .  .  .       IDCT        .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .    --------->     .  .  .  .  .  .  .  .    --------->     .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
    coeff block                            de-quantized block                          image samples

By clever pre-multiplication (scaling) of the 8x8 input block before the IDCT, some calculations can be eliminated, others gets simpler by using a modified IDCT equation. Scaling is only efficient when it's performed together with the de-quantization step, that is the Q-block is already multiplied by the scaling block before the transform, the scaling matrix SQ = S x Q:

.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .      simple       .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .    x SQ-block     .  .  .  .  .  .  .  .       IDCT        .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .    --------->     .  .  .  .  .  .  .  .    --------->     .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
    coeff block                            de-quantized                                image samples
                                              and scaled block

1-D transform

The 2-D IDCT above is tremendously slow to implement. Most of the fast DCT algorithms are based on the separability of the 2-D transform into successive 1-D row- and column transforms (or column- and row transform). The 8-point 1-D transform computes 8 outputs from 8 inputs based on and can be implemented as matrix multiplication. An iteration of 8x8 = 64 multiplications with irrational numbers. The transform is linear (this will be important later):

                                   1-D transform
X0  X1  X2  X3  X4  X5  X6  X7     --------------->    x0  x1  x2  x3  x4  x5  x6  x7

This transform function is applied 16 times for the 8x8 block as 8 row- and 8 column-transforms (16 x 64 = 1024 multiplication). Usually implementations use only one 1-D routine working on input and output arrays. In order to use the same function, the result is written transposed through a temp-block:

0  1  2  3  4  5  6  7                   0  .  .  .  .  .  .  .                   0  1  2  3  4  5  6  7 
.  .  .  .  .  .  .  .                   1  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .     1D IDCT       2  .  .  .  .  .  .  .     1D IDCT       .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .    --------->     3  .  .  .  .  .  .  .    --------->     .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .       row         4  .  .  .  .  .  .  .     column        .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   5  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   6  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   7  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .

Fast 1-D DCT: the LLM method

Among the endless number of fast algorithms I've been fascinated by Löffler's paper from 1989, its clean and simple solution. It gives a method the compute the 8-point 1-D transform with 11 multiplications:


Image source: Reznik, Hindsy, Zhangz, Yuz, and Ni, Efficient Fixed-Point Approximations of the 8x8 Inverse Discrete Cosine Transform

As you can see there are no multiplications needed for X0 and X4. This is because LLM uniformly scales the orthonormal DCT matrix by √8, i.e. the DCT matrix's vectors are of lenght √8 (instead of 1 as for an orthonormal matrix). This trick makes v0 and v4 all having coordinates of 1. Applying this DCT matrix twice (row/column method) gives results multiplied by 8 = (√8)2 - which is easy to implement as right shift to get the correct values after the transform.

LLM uses these 7 irrational constants:

γ = √2 ≈ 1.414213562

α = √2cos(3π/8) ≈ 0.5411961
β = √2sin(3π/8) ≈ 1.306562965

η = cos(3π/16) ≈ 0.831469612
θ = sin(3π/16) ≈ 0.555570233

δ = cos(π/16) ≈ 0.98078528
ε = sin(π/16) ≈ 0.195090322

After rearranging the order of inputs 2 basic structures reveal:

The adder takes 4 inputs (a, b, c and d) and computes 4 new outputs according to:

p  = a + b
n  = a - b
a' = p + d
b' = n + c
c' = n - c
d' = p - d

The multiplier is the butterfly or rotation. It takes 2 inputs and 2 irrational constants. The constants are cos/sin pairs of α/β, η/θ or δ/ε; denoted here as K/S. Two outputs are computed according to the equation:

a' = Ka + Sb
b' = Kb - Sa

The equation needs 4 multiplications to evaluate. As in the original paper, it can be reduced to 3 using intermediates. There are 6 possibilities:

c  = K(a+b)                 c  = S(a+b)       
a' = c - b(K-S)             a' = a(K-S) + c 
b' = c - a(K+S)             b' = b(K+S) - c 
                                            
c  = K(a-b)                 c  = S(a-b)       
a' = b(K+S) + c             a' = a(K+S) - c 
b' = a(K-S) - c             b' = b(K-S) - c 
                                            
c  = K(b-a)                 c  = S(b-a)       
a' = b(K+S) - c             a' = a(K+S) + c 
b' = a(K-S) + c             b' = b(K-S) + c 

These are mathematically equivalent but chosing one of them have an impact when designing the scaling matrix and the integer IDCT implementation. The first column is K-type, where the 3 new constants are K, K+S and K-S, while the second column the S-type, there the 3 new constants are S, K+S and K-S. The adder and the multiplier can be implemented as macros (XADD, KROT/SROT).

How the scaling matrix works

First lets see how to eliminate the √2 multiplication of F3 and F5 inside a simplified LLM 1D-transform.

    +-------------+                                     +----------+         
F0 -|---->       -|-->  f0                       F0 ----|->       -|-->  f0    
F1 -|---->       -|-->  f1                       F1 ----|->       -|-->  f1    
F2 -|---->       -|-->  f2                       F2 ----|->       -|-->  f2    
F3 -|-√2->       -|-->  f3   --->              √2F3 ----|->       -|-->  f3
F4 -|---->       -|-->  f4                       F4 ----|->       -|-->  f4    
F5 -|-√2->       -|-->  f5                     √2F5 ----|->       -|-->  f5
F6 -|---->       -|-->  f6                       F6 ----|->       -|-->  f6    
F7 -|---->       -|-->  f7                       F7 ----|->       -|-->  f7    
    +-------------+                                     +----------+         
        original                                          modified

The solution uses scaled inputs: that is F3 and F5 entering the 1-D transform is already multiplied by √2. Because if the 2-D transform is successive row- the column transforms with transposed output (using the same function for both), understanding the scaling matrix is a little complicated. So lets go backwards:

1. The modified 1-D transform of the second block works correctly when F3 and F5 of each row is already √2-multiplied:

                                                                 0  1  2  3  4  5  6  7
                                                                 
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .   8 x 1D  .  .  .  .  .  .  .  .
                                                         ----->  .  .  .  +  .  +  .  .   ----->  .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .    tr.p   .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                                                           output

2. These two columns are the result of the modified 1-D transform of row 3 and 5 of the previous block (transposed output). Because the transform is linear, obtaining all outputs scaled by √2 can be achieved by scaling all inputs by √2 before the transform, using √2f(x) = f(√2x), where x is the (row) vector:

                                                                 0  1  2  3  4  5  6  7
                                                                   
                           0    .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           1    .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           2    .  .  .  .  .  .  .  .   8 x 1D  .  .  .  +  .  +  .  .   8 x 1D  .  .  .  .  .  .  .  .
                           3    +  +  +  +  +  +  +  +   ----->  .  .  .  +  .  +  .  .   ----->  .  .  .  .  .  .  .  .
                           4    .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           5    +  +  +  +  +  +  +  +           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           6    .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           7    .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                                                           output

3. But the first block transform using the modified 1-D transform also expects F3 and F5 of each row multiplied by √2. In row 3 and 5, where all inputs are already scaled by √2, F3 and F5 will be double scaled, ie. (√2)² = 2, the intersection points marked by #:

                                                                 0  1  2  3  4  5  6  7
                                                                   
                           0    .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           1    .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           2    .  .  .  +  .  +  .  .   8 x 1D  .  .  .  +  .  +  .  .   8 x 1D  .  .  .  .  .  .  .  .
                           3    +  +  +  #  +  #  +  +   ----->  .  .  .  +  .  +  .  .   ----->  .  .  .  .  .  .  .  .
                           4    .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           5    +  +  +  #  +  #  +  +           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           6    .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           7    .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                                                           output

This gives our scaling matrix, S, to multiply values of the 8x8 coefficient block before applying the modified IDCT twice, first on rows then on colums, thus eliminating the √2 multiplication of F3 and F5 inside the modified LLM 1-D transform:

          1   1   1  √2   1  √2   1   1
          1   1   1  √2   1  √2   1   1
          1   1   1  √2   1  √2   1   1
   S =   √2  √2  √2   2  √2   2  √2  √2
          1   1   1  √2   1  √2   1   1
         √2  √2  √2   2  √2   2  √2  √2
          1   1   1  √2   1  √2   1   1
          1   1   1  √2   1  √2   1   1

S is also the result of a matrix multiplication of the column vector v and its transposed vector vT:

S = v vT,

where v = (1, 1, 1, √2, 1, √2, 1, 1):

      |    1   1   1  √2   1  √2   1   1     (vT)
   ---|---------------------------------------------          
      |                                          
   1  |    1   1   1  √2   1  √2   1   1                 
   1  |    1   1   1  √2   1  √2   1   1                 
   1  |    1   1   1  √2   1  √2   1   1               
  √2  |   √2  √2  √2   2  √2   2  √2  √2       =  S
   1  |    1   1   1  √2   1  √2   1   1                 
  √2  |   √2  √2  √2   2  √2   2  √2  √2                 
   1  |    1   1   1  √2   1  √2   1   1                 
   1  |    1   1   1  √2   1  √2   1   1                 
  (v) |                                          

This essentially means that we have to focus only on finding a good scaling vector for a modified LLM transform, then we can simply obtain the final scaling matrix by computing v vT for the 2-D row/column method. Integrating this scaling matrix (S) into the quantization matrix (Q) by multiplying each value pairs yields the (real) SQ matrix - and a faster 1-D transform.

Scaling the even part butterfly

For X2 and X6 the even part starts with a butterfly multiplication. The 2 irrational constants (K and S) are:

K = √2cos(3π/8)
S = √2sin(3π/8)

The outputs are computed according to the equation:

x6 = K X6 + S X2 
x2 = K X2 - S X6 

We can obtain x6 without multiplication inside a modified LLM transform when the inputs are scaled: let W6 = K X6 and W2 = S X2. Then x6 is simply

x6 = W6 + W2

while for x2 we get:

x2 = K/S W2 - S/K W6

The constants of K/S and S/K are trigonometric identities:

K/S = √2cos(3π/8) / √2sin(3π/8) = cot(3π/8) = √2 - 1
S/K = √2sin(3π/8) / √2cos(3π/8) = tan(3π/8) = √2 + 1

then

x2 = (√2 - 1) W2  - (√2 + 1) W6  = √2 W2 - W2 - √2 W6 - W6  = √2 ( W2 - W6 ) - W2 - W6

That is, the scaled even part needs only one multiplication of √2.

I discovered this accidently by looking at values in the Excel sheet I use to calculate factors. Indeed sin(3π/8) = ½√(2+√2) and cos(3π/8) = ½√(2-√2). Then K/S = √( (2-√2)/(2+√2) ) = √( (2-√2)²/(4-2) ) = √ ( (4-4√2+2) / 2 ) = √ ( 2-2√2+1 ) = √ ( (√2-1)² ) = √2-1. S/K is similar.

Now we can include α and β in the scaling vector v, and use simplified equations for computing the rotation for the even part:

v = (1, 1, β, γ, 1, γ, α, 1) = ( 1, 1, √2sin(3π/8), √2, 1, √2, √2cos(3π/8), 1 )

Scaling the odd part butterfly

The odd-part starts with an adder: that is all 4 inputs must be equally scaled. That's a problem for simplifying the two butterfly multiplications for η/θ and δ/ε that follows. The task is to find a common factor, r, which may simplify these two butterfly multiplications. When all 4 inputs of X1, X7, X5 and X3 are pre-scaled by r, then the equations become:

x1 = η/r X1 + θ/r X7
x7 = η/r X7 - θ/r X1

x5 = δ/r X5 + ε/r X3
x3 = δ/r X3 - ε/r X5

The only thing I could do is to chose one of the constants for pre-scaling - thus reducing the number of multiplications in one of the butterfly to 2. F. ex. when r = η we get:

x1 = X1 + θ/η X7
x7 = X7 - θ/η X1

x5 = δ/η X5 + ε/η X3
x3 = δ/η X3 - ε/η X5

Furthermore, using one of the 3-mul equations for x5/x3, the total number of multiplications will be 5. F. ex. using the S(b-a) intermediate:

c = ε/η (x3 - x5)
x5 = c + ( δ/η + ε/η ) X5
x3 = c + ( δ/η - ε/η ) X3

Including r = η in the scaling vector v for the odd part leads one of the the final solutions:

v = (1, η, β, γη, 1, γη, α, η)

and a modified 1-D LLM transform with 6 multiplications.

Detour

The last equations reveal something promising: δ + ε and δ - ε are the sum and difference of cos(x) and sin(x) of the same angle and are trigonometric identities! I was hoping these equations might help the IDCT calculation, somehow, especially in the scaled version. Not really. The complexity remains the same, but it is an interesting investigation:

The odd part runs on the π/16 line and there seem to be almost no relationship between sin/cos(π/16) and sin/cos(3π/16). Almost.

I read this somewhere:

sin(x) + cos(x) = √2 sin( x + π/4)

I always wonder about trigonometric identities, how come is this true? It's pretty hard to see this with geometry, drawing triangles, so lets plot these 3 functions using Fooplot for example:

Plotting sin(x) + cos(x):

The fig above shows sin(x), cos(x) and sin(x)+cos(x). Ok, periodic, same as sin/cos on 2π. Shifted on the x axis by +/- π/4 compared to sin and cos. The maximum is at π/4, where both sin(π/4) = cos(π/4) = √2/2, ergo max = √2.

So sin(x) + cos(x) is something like:

sin(x) + cos(x) = √2 sin(x + π/4) or

sin(x) + cos(x) = √2 cos(x - π/4) or

sin(x) + cos(x) = √2 sin(x - 7π/4) or

sin(x) + cos(x) = √2 cos(x + 7π/4) or

sin(x) + cos(x) = √2 cos(-x + π/4).. and so on.

Why is this interesting? Because in the LLM transform both δ/ε (π/16) and η/θ (3π/16) are multiplies of π/16 angles. Adding π/4 to π/16 is 5π/16, which is symmetric to 3π/16, sin(3π/16)=cos(5π/16) and cos(3π/16)=sin(5π/16). Now using the above equations we can state the following identity for the sum of δ + ε:

δ + ε = cos(π/16) + sin(π/16) = √2 sin(π/16 + π/4) = √2 sin(5π/16) = √2 cos(3π/16) = √2 η

and similarly for the sum of η + θ:

η + θ = cos(3π/16) + sin(3π/16) = √2 sin(3π/16 + π/4) = √2 sin(7π/16) = √2 cos(π/16) = √2 δ

Plotting cos(x) - sin(x):

For the other multiplier in the 3-mul butterfly version K-S appears, lets make identities for these two in regard of the constants of the LLM transform.

cos(x) - sin(x) = √2 sin(x + 3π/4)

cos(x) - sin(x) = √2 cos(x + π/4)

cos(x) - sin(x) = √2 sin(x - 5π/4)

cos(x) - sin(x) = √2 cos(x - 7π/4)

Now using the above we can state the following:

δ - ε = cos(π/16) - sin(π/16) = √2 cos(π/16 + π/4) = √2 cos(5π/16) = √2 sin(3π/16) = √2 θ

and

η - θ = cos(3π/16) - sin(3π/16) = √2 cos(3π/16 + π/4) = √2 cos(7π/16) = √2 sin(π/16) = √2 ε

Giving som useful identities for the LLM transform:

δ + ε = √2 η
δ - ε = √2 θ
ε - δ = -√2 θ

and

η + θ = √2 δ
η - θ = √2 ε
θ - η = -√2 ε

Again, I just could't find a better solution even with these identities for the scaled LLM transform.

Summary

Using scaled coefficients before the IDCT can be a significant improvement in speed.

The idea is that scaling happens once for the whole 8x8 block together with dequantization (both are multiplications), simplifying computations for the modified 1D transform function. This is called a scaled-quantization table.

For the integer version, only an extra post-scale stage is required in the form of right shift:

Scaled dequantization --> modified IDCT --> right shift

Using integer CPU arithmetics

When using integer arithmetics, a good approximation of v is used, where V = INT (2n v). After the second 1D-transform, all values are shifted right by n:

And an example integer matrix for scaled x3 and x5 by √2. Here n=7 and √2 × 128 = 181,019336 seems like a good integer approximation:

128	128	128	181	128	181	128	128
128	128	128	181	128	181	128	128
128	128	128	181	128	181	128	128
181	181	181	256	181	256	181	181
128	128	128	181	128	181	128	128
181	181	181	256	181	256	181	181
128	128	128	181	128	181	128	128
128	128	128	181	128	181	128	128

As for the modified LLM transform the final post-scale will be 7 + 3 = 10.

Notes:

  1. ^  LLM 1989 (C. Loeffler, A. Ligtenberg, and G. S. Moschytz, Practical fast 1-D DCT algorithms with 11 multiplications

Tue Jan 1 20:56:38 UTC+0100 2019 © A. Tarpai