http://halicery.com/1IDCT/Scaled IDCT.html

Scaled LLM[1] 1-D IDCT with 6 multiplications

© A. Tarpai

Source code for this scaled integer IDCT can be found on BitBucket, as part of an ITU T.81 test JPEG decoder I have written.

Results

Using any of these scaling vectors the number of multiplications can be reduced to 6 in a modified 1-D transform based on the LLM method:

v = (1, η, β, γη, 1, γη, α, η)
v = (1, δ, β, γδ, 1, γδ, α, δ)
v = (1, θ, β, γθ, 1, γθ, α, θ)
v = (1, ε, β, γε, 1, γε, α, ε)

The motivation for these algorithms was to implement an acceptable and fast 8x8 integer IDCT for my JPEG and video decoders, without borrowing code. The focus is therefore on the inverse transform.

Principles of scaled IDCT

The inverse DCT is the very last stage of the image decoding process, which reconstructs the original image sample values. After coefficient decoding each value of the block is multiplied by the quantization table values (de-quantization). The transform itself is quite computation intensive, the 2-D formula is an iteration of 8x8x8x8 = 4096 multiplications with irrational numbers, based on the equation

:

.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .     x Q-block     .  .  .  .  .  .  .  .       IDCT        .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .    --------->     .  .  .  .  .  .  .  .    --------->     .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
    coeff block                            de-quantized block                          image samples

By clever pre-multiplication (scaling) of the 8x8 input block before the IDCT, some calculations can be eliminated, others gets simpler by using a modified IDCT equation. Scaling is only efficient when it's performed together with the de-quantization step, that is the Q-block is already multiplied by the scaling block before the transform, the scaling matrix SQ = S x Q:

.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .      simple      .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .    x SQ-block     .  .  .  .  .  .  .  .       IDCT        .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .    --------->     .  .  .  .  .  .  .  .    --------->     .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
    coeff block                            de-quantized                                image samples
                                              and scaled block

1-D transform

The 2-D IDCT above is tremendously slow to implement. Most of the fast DCT algorithms are based on the separability of the 2-D transform into successive 1-D row- and column transforms (or column- and row transform). The 8-point 1-D transform computes 8 outputs from 8 inputs based on and can be implemented as matrix multiplication. An iteration of 8x8 = 64 multiplications with irrational numbers:

                                    1-D transform
X0  X1  X2  X3  X4  X5  X6  X7     --------------->    x0  x1  x2  x3  x4  x5  x6  x7

This transform function is applied 16 times for the 8x8 block as 8 row- and 8 column-transforms (16 x 64 = 1024 multiplication). Usually implementations use only one 1-D routine working on input and output arrays. In order to use the same function, the result is written transposed through a temp-block:

0  1  2  3  4  5  6  7                   0  .  .  .  .  .  .  .                   0  1  2  3  4  5  6  7 
.  .  .  .  .  .  .  .                   1  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .     1D IDCT       2  .  .  .  .  .  .  .     1D IDCT       .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .    --------->     3  .  .  .  .  .  .  .    --------->     .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .       row         4  .  .  .  .  .  .  .     column        .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   5  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   6  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   7  .  .  .  .  .  .  .                   .  .  .  .  .  .  .  .

Fast 1-D DCT: the LLM method

Among the endless number of fast algorithms I've been fascinated by Löffler's paper from 1989, its clean and simple solution. It gives a method the compute the 8-point 1-D transform with 11 multiplications:


Image source: Reznik, Hindsy, Zhangz, Yuz, and Ni, Efficient Fixed-Point Approximations of the 8x8 Inverse Discrete Cosine Transform

LLM uses these 7 irrational constants:

γ = √2 ≈ 1.414213562

α = √2cos(3π/8) ≈ 0.5411961
β = √2sin(3π/8) ≈ 1.306562965

η = cos(3π/16) ≈ 0.831469612
θ = sin(3π/16) ≈ 0.555570233

δ = cos(π/16) ≈ 0.98078528
ε = sin(π/16) ≈ 0.195090322

After rearranging the order of inputs 2 basic structures reveal:

The adder takes 4 inputs (a, b, c and d) and computes 4 new outputs according to:

p  = a + b
n  = a - b
a' = p + d
b' = n + c
c' = n - c
d' = p - d

The multiplier is the butterfly or rotation. It takes 2 inputs and 2 irrational constants. The constants are cos/sin pairs of α/β, η/θ or δ/ε; denoted here as K/S. Two outputs are computed according to the equation:

a' = Ka + Sb
b' = Kb - Sa

The equation needs 4 multiplications to evaluate. As in the original paper, it can be reduced to 3 using intermediates. There are 6 possibilities:

c  = K(a+b)                 c  = S(a+b)       
a' = c - b(K-S)             a' = a(K-S) + c 
b' = c - a(K+S)             b' = b(K+S) - c 
                                            
c  = K(a-b)                 c  = S(a-b)       
a' = b(K+S) + c             a' = a(K+S) - c 
b' = a(K-S) - c             b' = b(K-S) - c 
                                            
c  = K(b-a)                 c  = S(b-a)       
a' = b(K+S) - c             a' = a(K+S) + c 
b' = a(K-S) + c             b' = b(K-S) + c 

These are mathematically equivalent but chosing one of them have an impact when designing the scaling matrix and the integer IDCT implementation. The first column is K-type, where the 3 new constants are K, K+S and K-S, while the second column the S-type, there the 3 new constants are S, K+S and K-S. The adder and the multiplier can be implemented as macros (XADD, KROT/SROT).

How the scaling matrix works

First lets see how to eliminate the √2 multiplication of F3 and F5 inside a simplified LLM 1D-transform.

    +-------------+                                     +----------+         
F0 -|---->       -|-->  f0                       F0 ----|->       -|-->  f0    
F1 -|---->       -|-->  f1                       F1 ----|->       -|-->  f1    
F2 -|---->       -|-->  f2                       F2 ----|->       -|-->  f2    
F3 -|-√2->       -|-->  f3   --->              √2F3 ----|->       -|-->  f3
F4 -|---->       -|-->  f4                       F4 ----|->       -|-->  f4    
F5 -|-√2->       -|-->  f5                     √2F5 ----|->       -|-->  f5
F6 -|---->       -|-->  f6                       F6 ----|->       -|-->  f6    
F7 -|---->       -|-->  f7                       F7 ----|->       -|-->  f7    
    +-------------+                                     +----------+         
        original                                          modified

The solution uses scaled inputs: that is F3 and F5 entering the 1-D transform is already multiplied by √2. Because if the 2-D transform is successive row- the column transforms with transposed output (using the same function for both), understanding the scaling matrix is a little complicated. So lets go backwards:

1. The modified 1-D transform of the second block works correctly when F3 and F5 of each row is already √2-multiplied:

                                                                 0  1  2  3  4  5  6  7
                                                                 
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .   8 x 1D  .  .  .  .  .  .  .  .
                                                         ----->  .  .  .  +  .  +  .  .   ----->  .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .    tr.p   .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                                                           output

2. These two columns are the result of the modified 1-D transform of row 3 and 5 of the previous block (transposed output). Because the transform is linear, obtaining all outputs scaled by √2 can be achieved by scaling all inputs by √2 before the transform, using √2f(x) = f(√2x), where x is the (row) vector:

                                                                 0  1  2  3  4  5  6  7
                                                                   
                           0    .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           1    .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           2    .  .  .  .  .  .  .  .   8 x 1D  .  .  .  +  .  +  .  .   8 x 1D  .  .  .  .  .  .  .  .
                           3    +  +  +  +  +  +  +  +   ----->  .  .  .  +  .  +  .  .   ----->  .  .  .  .  .  .  .  .
                           4    .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           5    +  +  +  +  +  +  +  +           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           6    .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           7    .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                                                           output

3. But the first block transform using the modified 1-D transform also expects F3 and F5 of each row multiplied by √2. In row 3 and 5, where all inputs are already scaled by √2, F3 and F5 will be double scaled, ie. (√2)² = 2, the intersection points marked by #:

                                                                 0  1  2  3  4  5  6  7
                                                                   
                           0    .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           1    .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           2    .  .  .  +  .  +  .  .   8 x 1D  .  .  .  +  .  +  .  .   8 x 1D  .  .  .  .  .  .  .  .
                           3    +  +  +  #  +  #  +  +   ----->  .  .  .  +  .  +  .  .   ----->  .  .  .  .  .  .  .  .
                           4    .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           5    +  +  +  #  +  #  +  +           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           6    .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                           7    .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                                                           output

This gives our scaling matrix, S, to multiply values of the 8x8 coefficient block before applying the modified IDCT twice, first on rows then on colums, thus eliminating the √2 multiplication of F3 and F5 inside the LLM 1-D transform:

          1   1   1  √2   1  √2   1   1
          1   1   1  √2   1  √2   1   1
          1   1   1  √2   1  √2   1   1
   S =   √2  √2  √2   2  √2   2  √2  √2
          1   1   1  √2   1  √2   1   1
         √2  √2  √2   2  √2   2  √2  √2
          1   1   1  √2   1  √2   1   1
          1   1   1  √2   1  √2   1   1

S is also the result of a matrix multiplication of the column vector v and its transposed vector vT:

S = v vT,

where v = (1, 1, 1, √2, 1, √2, 1, 1):

      |    1   1   1  √2   1  √2   1   1 (vt)
   ---|------------------------------------------------            
      |                                          
   1  |    1   1   1  √2   1  √2   1   1                 
   1  |    1   1   1  √2   1  √2   1   1                 
   1  |    1   1   1  √2   1  √2   1   1               
  √2  |   √2  √2  √2   2  √2   2  √2  √2       =  S
   1  |    1   1   1  √2   1  √2   1   1                 
  √2  |   √2  √2  √2   2  √2   2  √2  √2                 
   1  |    1   1   1  √2   1  √2   1   1                 
   1  |    1   1   1  √2   1  √2   1   1                 
  (v) |                                          

This essentially means that we have to focus only on finding a good scaling vector for a modified LLM transform, then we can simply obtain the final scaling matrix by computing v vT for the 2-D row/column method. Integrating this scaling matrix (S) into the quantization matrix (Q) by multiplying each value pairs yields the (real) SQ matrix - and a faster 1-D transform.

Scaling the even part butterfly

For X2 and X6 the even part starts with a butterfly multiplication. The 2 irrational constants (K and S) are:

K = √2cos(3π/8)
S = √2sin(3π/8)

The outputs are computed according to the equation:

x6 = K X6 + S X2 
x2 = K X2 - S X6 

We can obtain x6 without multiplication inside a modified LLM transform when the inputs are scaled: let W6 = K X6 and W2 = S X2. Then x6 is simply

x6 = W6 + W2

while for x2 we get:

x2 = K/S W2 - S/K W6

The constants of K/S and S/K are trigonometric identities:

K/S = √2cos(3π/8) / √2sin(3π/8) = cot(3π/8) = √2 - 1
S/K = √2sin(3π/8) / √2cos(3π/8) = tan(3π/8) = √2 + 1

then

x2 = (√2 - 1) W2  - (√2 + 1) W6  = √2 W2 - W2 - √2 W6 - W6  = √2 ( W2 - W6 ) - W2 - W6

That is, the scaled even part needs only one multiplication.

I discovered this accidently by looking at values in the Excel sheet I use to calculate factors. Indeed sin(3π/8) = ½√(2+√2) and cos(3π/8) = ½√(2-√2). Then K/S = √( (2-√2)/(2+√2) ) = √( (2-√2)²/(4-2) ) = √ ( (4-4√2+2) / 2 ) = √ ( 2-2√2+1 ) = √ ( (√2-1)² ) = √2-1. S/K is similar.

Now we can include α and β in the scaling vector v, and use simplified equations for computing the rotation for the even part:

v = (1, 1, β, γ, 1, γ, α, 1) = ( 1, 1, √2sin(3π/8), √2, 1, √2, √2cos(3π/8), 1 )

Scaling the odd part butterfly


The odd part runs on the π/16 line and there seem to be almost no relationship between sin/cos(π/16) and sin/cos(3π/16). Almost. Now lets consider both butterflies:

f1 = η F1 + θ F7
f7 = η F7 - θ F1

f5 = δ F5 + ε F3
f3 = δ F3 - ε F5

Using one of the 3-mul equations for f5/f3, f. ex. the S(b-a) intermediate, we get:

c = ε (f3 - f5)
f5 = c + ( δ + ε ) F5
f3 = c + ( δ - ε ) F3

In the last equations the sum and difference of sin/cos of the same angle appears, and these are trigonometric identities! Maybe this helps the scaled IDCT somehow and might help to simplify the odd part of the LLM transform?

I read this somewhere:

sin(x) + cos(x) = √2 sin( x + π/4)

I always wonder about trigonometric identities, how come is this true? It's pretty hard to see this with geometry, drawing triangles, so lets plot these 3 functions using Fooplot for example:

Plotting sin(x) + cos(x):

The fig above shows sin(x), cos(x) and sin(x)+cos(x). Ok, periodic, same as sin/cos on 2π. Shifted on the x axis by +/- π/4 compared to sin and cos. The maximum is at π/4, where both sin(π/4) = cos(π/4) = √2/2, ergo max = √2.

So sin(x) + cos(x) is something like:

sin(x) + cos(x) = √2 sin(x + π/4) or

sin(x) + cos(x) = √2 cos(x - π/4) or

sin(x) + cos(x) = √2 sin(x - 7π/4) or

sin(x) + cos(x) = √2 cos(x + 7π/4) or

sin(x) + cos(x) = √2 cos(-x + π/4).. and so on.

Why is this interesting? Because in the LLM transform both δ/ε (π/16) and η/θ (3π/16) are multiplies of π/16 angles. Adding π/4 to π/16 is 5π/16, which is symmetric to 3π/16, sin(3π/16)=cos(5π/16) and cos(3π/16)=sin(5π/16). Now using the above equations we can state the following identity for the sum of δ + ε:

δ + ε = cos(π/16) + sin(π/16) = √2 sin(π/16 + π/4) = √2 sin(5π/16) = √2 cos(3π/16) = √2 η

and similarly for the sum of η + θ:

η + θ = cos(3π/16) + sin(3π/16) = √2 sin(3π/16 + π/4) = √2 sin(7π/16) = √2 cos(π/16) = √2 δ

Plotting cos(x) - sin(x):

For the other multiplier in the 3-mul butterfly version K-S appears, lets make identities for these two in regard of the constants of the LLM transform.

cos(x) - sin(x) = √2 sin(x + 3π/4)

cos(x) - sin(x) = √2 cos(x + π/4)

cos(x) - sin(x) = √2 sin(x - 5π/4)

cos(x) - sin(x) = √2 cos(x - 7π/4)

Now using the above we can state the following:

δ - ε = cos(π/16) - sin(π/16) = √2 cos(π/16 + π/4) = √2 cos(5π/16) = √2 sin(3π/16) = √2 θ

and

η - θ = cos(3π/16) - sin(3π/16) = √2 cos(3π/16 + π/4) = √2 cos(7π/16) = √2 sin(π/16) = √2 ε

Giving som useful identities for the LLM transform:

δ + ε = √2 η
δ - ε = √2 θ
ε - δ = -√2 θ
and
η + θ = √2 δ
η - θ = √2 ε
θ - η = -√2 ε

Back to the original problem:

f1 = η F1 + θ F7
f7 = η F7 - θ F1

or

c = θ (f7 - f1)
f1 = c + √2 δ F1
f7 = c + √2 ε F7

and

c = ε (f3 - f5)
f5 = c + √2 η F5
f3 = c + √2 θ F3

I was hoping that one of these equations, any of the combinations will give a chance to decrease the number of multiplications below 5 for the the two odd butterfly, but no success. The only thing I could do is to chose one of the constants for pre-scaling - thus reducing the number in one of the butterfly mul to 2. For example when all 4 inputs of F1, F7, F5 and F3 are equally scaled by r = η, we get 5 new irrational constants and equations with 5 multiplications:

f1 = F1 + θ/η F7
f7 = F7 - θ/η F1

c = ε/η (f3 - f5)
f5 = c + √2 F5
f3 = c + √2 θ/η F3

Including r = η in the scaling vector v for the odd part leads one of the the final solutions:

v = (1, η, β, γη, 1, γη, α, η)

and a modified 1-D LLM transform with 6 multiplications.

Notes:

  1. ^ LLM 1989 (C. Loeffler, A. Ligtenberg, and G. S. Moschytz, Practical fast 1-D DCT algorithms with 11 multiplications

Sun Jun 17 17:37:16 UTC+0200 2018 © A. Tarpai