Improving LLM IDCT by scaled coefficients

© Attila Tarpai

Results

Using any of these scaling vectors the number of multiplications can be reduced to 6 in the LLM 1D transform:

v = (1, η, β, γη, 1, γη, α, η)
v = (1, δ, β, γδ, 1, γδ, α, δ)
v = (1, θ, β, γθ, 1, γθ, α, θ)
v = (1, ε, β, γε, 1, γε, α, ε)

LLM method

LLM 1989 (C. Loeffler, A. Ligtenberg, and G. S. Moschytz, Practical fast 1-D DCT algorithms with 11 multiplications)

LLM factorization uses 7 irrational constants and 11 multiplications for the 8-point 1D transform. The 1D transform is a linear transformation (this will be important later), computes 8 outputs from 8 inputs:

                          linear transformation
0  1  2  3  4  5  6  7     ------------------->       0' 1' 2' 3' 4' 5' 6' 7' 

This transform function is called 16 times for the 8x8 block: 8 row- and 8 column-transforms. To implement the IDCT with only one function(1), transformed results of one row are written transposed through a temp-block (T '). F.ex. elements of the first row of the block wander like this:

0  1  2  3  4  5  6  7                   0  .  .  .  .  .  .  .              0  1  2  3  4  5  6  7
.  .  .  .  .  .  .  .                   1  .  .  .  .  .  .  .              .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .     8 x 1D        2  .  .  .  .  .  .  .   8 x 1D     .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .    --------->     3  .  .  .  .  .  .  .  --------->  .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .       row         4  .  .  .  .  .  .  .   column     .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   5  .  .  .  .  .  .  .              .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   6  .  .  .  .  .  .  .              .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .                   7  .  .  .  .  .  .  .              .  .  .  .  .  .  .  .
         T1                                         T'                                   T2

(1) By writing separate row- and column-transform functions there are even more possibilities optimizing the IDCT.

Passing a cleverly scaled block to the IDCT gives opportunities to decrease the complexity of computations inside the 1D transform function.

How the scaling matrix works

I've been thinking how to get rid of the √2 multiplication of x3 and x5 inside the LLM 1D-transform.

    +-------------+                                     +----------+         
x0 -|---->       -|-->                           x0 ----|->       -|-->      
x1 -|---->       -|-->                           x1 ----|->       -|-->      
x2 -|---->       -|-->                           x2 ----|->       -|-->      
x3 -|-√2->       -|-->       --->              √2x3 ----|->       -|-->
x4 -|---->       -|-->                           x4 ----|->       -|-->      
x5 -|-√2->       -|-->                         √2x5 ----|->       -|-->
x6 -|---->       -|-->                           x6 ----|->       -|-->      
x7 -|---->       -|-->                           x7 ----|->       -|-->      
    +-------------+                                     +----------+         
        original                                         simplified

The solution uses scaled inputs: that is x3 and x5 entering the 1D-transform every time are already multiplied by √2.

Because of the one function and transposed output, understanding scaling was a little complicated, so lets go backwards:

1. Inputs of the second block transform: all x3 and x5 are already √2-multiplied:

                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                         8 x 1D  .  .  .  +  .  +  .  .   8 x 1D  .  .  .  .  .  .  .  .
                                                         ----->  .  .  .  +  .  +  .  .   ----->  .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                 .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                                                           output

2. Values of + come from the first block transform of row 3 and 5 (transposed output):

                                .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  .  .  .  .  .   8 x 1D  .  .  .  +  .  +  .  .   8 x 1D  .  .  .  .  .  .  .  .
                                +  +  +  +  +  +  +  +   ----->  .  .  .  +  .  +  .  .   ----->  .  .  .  .  .  .  .  .
                                .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                +  +  +  +  +  +  +  +           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  .  .  .  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                                                           output

To get all 8 outputs multiplied by √2 after the first transform of row 3 and 5, all inputs should be multiplied by √2 before entering the 1D-transform. Because of linear transformation of vector x, f(wx) = wf(x), meaning all outputs will be scaled by a w factor, when all inputs are scaled by the same factor.

3. But the first block transform also expects x3 and x5 values multiplied by √2 (we use the same function for both). For row 3 and 5, where all inputs is already scaled by √2, x3 and x5 will be double scaled, ie. (√2)² = 2, the intersection points marked by #:

                                .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  +  .  +  .  .   8 x 1D  .  .  .  +  .  +  .  .   8 x 1D  .  .  .  .  .  .  .  .
                                +  +  +  #  +  #  +  +   ----->  .  .  .  +  .  +  .  .   ----->  .  .  .  .  .  .  .  .
                                .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                +  +  +  #  +  #  +  +           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                                                                                           output

This gives the pre-scale matrix, M, to multiply the 8x8 coefficient block before this modified IDCT:

 1   1   1  √2   1  √2   1   1
 1   1   1  √2   1  √2   1   1
 1   1   1  √2   1  √2   1   1
√2  √2  √2   2  √2   2  √2  √2
 1   1   1  √2   1  √2   1   1
√2  √2  √2   2  √2   2  √2  √2
 1   1   1  √2   1  √2   1   1
 1   1   1  √2   1  √2   1   1

Interestingly, M is the result of a matrix multiplication of vector v and its transposed vector vT (it does look symmetrical):

M = v vT

where v = (1, 1, 1, √2, 1, √2, 1, 1)

     |   .  .  .  +  .  +  .  .   (v)
  ---|------------------------------            
     |                                          
  .  |   .  .  .  +  .  +  .  .                 
  .  |   .  .  .  +  .  +  .  .                 
  .  |   .  .  .  +  .  +  .  .               
  +  |   +  +  +  #  +  #  +  +                 
  .  |   .  .  .  +  .  +  .  .                 
  +  |   +  +  +  #  +  #  +  +                 
  .  |   .  .  .  +  .  +  .  .                 
  .  |   .  .  .  +  .  +  .  .                 
 (vT)|                                          
                M

When the 8x8 coefficient block is multiplied by values of M before the IDCT, there is no need for √2 multiplication of x3 and x5 inside the LLM 1D-transform.

.  .  .  .  .  .  .  .          .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .          .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .  scale   .  .  .  +  .  +  .  .   8 x 1D  .  .  .  +  .  +  .  .   8 x 1D  .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .  ----->  +  +  +  #  +  #  +  +   ----->  .  .  .  +  .  +  .  .   ----->  .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .          .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .          +  +  +  #  +  #  +  +           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .          .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
.  .  .  .  .  .  .  .          .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
dequantized coefficients         scaled coefficients               row-transform                    column transform
       input                                                                                             output

So far, performance is actually worse: we replaced 2 × 16 multiplications with 64 multiplications. But because both dequantization and scaling is multiplication, we can simply use a scaled dequantization matrix instead:

                                .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  +  .  +  .  .   8 x 1D  .  .  .  +  .  +  .  .   8 x 1D  .  .  .  .  .  .  .  .
                                +  +  +  #  +  #  +  +   ----->  .  .  .  +  .  +  .  .   ----->  .  .  .  .  .  .  .  .
                                .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                +  +  +  #  +  #  +  +           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                .  .  .  +  .  +  .  .           .  .  .  +  .  +  .  .           .  .  .  .  .  .  .  .
                                dequantized and scaled              row-transform                    column transform
                                  coefficient input                                                      output

Thus eliminating 2 × 16 multiplications from the IDCT.

Using integer CPU arithmetics

When using integer arithmetics, a good approximation of v is used, where V = INT (2n v). After the second 1D-transform, all values are shifted right by n:

And an example integer matrix for scaled x3 and x5 by √2. Here n=7 and √2 × 128 = 181,019336 seems like a good integer approximation:

128	128	128	181	128	181	128	128
128	128	128	181	128	181	128	128
128	128	128	181	128	181	128	128
181	181	181	256	181	256	181	181
128	128	128	181	128	181	128	128
181	181	181	256	181	256	181	181
128	128	128	181	128	181	128	128
128	128	128	181	128	181	128	128

OK, this seemed to work, so lets see how we can help other multiplications in the 1D transformusing by using a pre-scale matrix.

Opt for the even part using scaled coefficients

For x2 and x6 the even part starts with a butterfly-mul. What if x6 and x2 is already pre-scaled? There are 2 factors:

K = √2cos(3π/8)
S = √2sin(3π/8)

and the butterfly mul:

y6 = K x6 + S x2
y2 = K x2 - S x6

Lets pre-scale x6 by K and x2 by S to compute y6 without multiplication:

w6= K x6,
w2= S x2,

then y6 is simply:

y6 = w6 + w2.

For y2 the equation we get:

y2 = K/S w2 - S/K w6

The constants K/S and S/K are magical:

K/S = √2cos(3π/8) / √2sin(3π/8) = cot(3π/8) = √2 - 1
S/K = √2sin(3π/8) / √2cos(3π/8) = tan(3π/8) = √2 + 1

then

y2 = (√2 - 1) w2 - (√2 + 1) w6 = √2 w2 - w2 - √2 w6 - w6 = √2 (w2-w6) - w2 - w6

That is, the even part butterfly mul needs only one multiplication!

I discovered this accidently by looking at values in the Excel sheet I use to calculate factors. Indeed π/8 and 3π/8 are nice angles: sin(3π/8) = ½√(2+√2) and cos(3π/8) = ½√(2-√2). Then K/S = √( (2-√2)/(2+√2) ) = √( (2-√2)²/(4-2) ) = √ ( (4-4√2+2) / 2 ) = √ ( 2-2√2+1 ) = √ ( (√2-1)² ) = √2-1. S/K is similar. The fig below might help, using half angles and Pythagoras, where α=π/4. Half angle sin/cos is computed from the red triangle, where sinα=√2 and 1-cosα=1-√2.

Then we just extend our pre-scale vector, v, and use the above equations for y6 and y2 in the even part:

v = ( 1, 1, √2sin(3π/8), √2, 1, √2, √2cos(3π/8), 1 )

Opt for the odd part using scaled coefficients

With pre-scaling, multiplying x3 and x5 by √2 can be eliminated from the Löffler 1D 8-point transform.

The problem is that the odd part continues with a butterly add, i.e all 4 inputs must be equally scaled here. I tried to find a factor for x1, x3, x5, x7 to simplify the 2 butterly mul, but unfortunately I cannot see any for sin/cos(3π/16) and sin/cos(π/16). They are just not so nice angles.

The only thing I could do was to choose one of the constants (η, δ, θ or ε) for scaling the odd part. This eliminates 2 multiplications from one of the butterly mul, but doesn't help the other. Still, better than nothing, and maybe there is some trick somewhere, I just cannot see it.

Extending the pre-scale vector by f.ex. η:

v = (1, η, β, γη, 1, γη, α, η) 

Then for butterly-mul of x1/x7:

y1 = x1 + θ/η x7
y7 = x7 - θ/η x1

For x5/x3 the new constants are:

y5 = δ/η x5 + ε/η x3
y3 = δ/η x3 - ε/η x5


Practicing pre-scale

Here are some float IDCT-s I used for testing scaled coefficient inputs with different scaling vectors and 1D-IDCT functions.

double k62;    //(M_SQRT2*cos(3*M_PI/8))
double s62;    //(M_SQRT2*cos(3*M_PI/8))
               
double k53;    //(cos(M_PI/16))
double s53;    //(sin(M_PI/16))
               
double k17;    //(cos(3*M_PI/16))
double s17;    //(sin(3*M_PI/16))
#define FXADD(a,b,c,d) p=a+b, n=a-b, a=p+d, b=n+c, c=n-c, d=p-d

#define FXMUL(xa,xb,kcos,ksin) n=xa, xa=kcos*xa+ksin*xb, xb=kcos*xb-ksin*n
#define FXMUL(xa,xb,kcos,ksin) n=xa, p=kcos*(xa+xb), xa=p+(ksin-kcos)*xb, xb=p-(ksin+kcos)*n

1. LLM direct implementation using float


static void fidct1D(double *x, double *y)
{
	double p, n;

	x[3]*=M_SQRT2;
	x[5]*=M_SQRT2;
	FXADD(x[1],x[7],x[5],x[3]);
	FXMUL(x[1],x[7],k17,s17);
	FXMUL(x[5],x[3],k53,s53);

	FXMUL(x[6],x[2],k62,s62);
	FXADD(x[0],x[4],x[2],x[6]);

	y[0*8]= x[0]+x[1];
	y[1*8]= x[4]+x[5];
	y[2*8]= x[2]+x[3];
	y[3*8]= x[6]+x[7];
	y[4*8]= x[6]-x[7];
	y[5*8]= x[2]-x[3];
	y[6*8]= x[4]-x[5];
	y[7*8]= x[0]-x[1];
}

#define PRE 0

void fidct(int *b, short *y)
{
	double B[64], B2[64];
	int i;
	for (i=0; i<64; i++) B[i]= b[i];	// to float
	B[0]+= (1<<(3+PRE-1));	// rounding control

	for (i=0; i<8; i++) fidct1D(B+i*8, B2+i);
	for (i=0; i<8; i++) fidct1D(B2+i*8, B+i);
	
	for (i=0; i<64; i++) y[i] = CLIP[ (int)(B[i]/(1<<(3+PRE))) ];
}

2. LLM float pre-scale for √2 x3 and x5

double f2scale [8*8] = {	// gamma
	1,1,1,M_SQRT2,1,M_SQRT2,1,1,
	1,1,1,M_SQRT2,1,M_SQRT2,1,1,
	1,1,1,M_SQRT2,1,M_SQRT2,1,1,
	M_SQRT2,M_SQRT2,M_SQRT2,2,M_SQRT2,2,M_SQRT2,M_SQRT2,
	1,1,1,M_SQRT2,1,M_SQRT2,1,1,
	M_SQRT2,M_SQRT2,M_SQRT2,2,M_SQRT2,2,M_SQRT2,M_SQRT2,
	1,1,1,M_SQRT2,1,M_SQRT2,1,1,
	1,1,1,M_SQRT2,1,M_SQRT2,1,1,
};

static void fidct1D(double *x, double *y)
{
	double p, n;

	FXADD(x[1],x[7],x[5],x[3]);
	FXMUL(x[1],x[7],k17,s17);
	FXMUL(x[5],x[3],k53,s53);

	FXMUL(x[6],x[2],k62,s62);
	FXADD(x[0],x[4],x[2],x[6]);

	y[0*8]= x[0]+x[1];
	y[1*8]= x[4]+x[5];
	y[2*8]= x[2]+x[3];
	y[3*8]= x[6]+x[7];
	y[4*8]= x[6]-x[7];
	y[5*8]= x[2]-x[3];
	y[6*8]= x[4]-x[5];
	y[7*8]= x[0]-x[1];
}

#define PRE 0

void fidct(int *b, short *y)
{
	double B[64], B2[64];
	int i;
	for (i=0; i<64; i++) B[i]= b[i] * f2scale[i];	// scaling
	B[0]+= (1<<(3+PRE-1));	// rounding control

	for (i=0; i<8; i++) fidct1D(B+i*8, B2+i);
	for (i=0; i<8; i++) fidct1D(B2+i*8, B+i);
	
	for (i=0; i<64; i++) y[i] = CLIP[ (int)(B[i]/(1<<(3+PRE))) ];
}

3. LLM integer pre-scale for √2 x3 and x5

This is for preparing the integer version. An integer approximation of the above matrix (multiplied by 27)

int i2scale [8*8] = {
	128, 128, 128, 181, 128, 181, 128, 128, 
	128, 128, 128, 181, 128, 181, 128, 128, 
	128, 128, 128, 181, 128, 181, 128, 128, 
	181, 181, 181, 256, 181, 256, 181, 181, 
	128, 128, 128, 181, 128, 181, 128, 128, 
	181, 181, 181, 256, 181, 256, 181, 181, 
	128, 128, 128, 181, 128, 181, 128, 128, 
	128, 128, 128, 181, 128, 181, 128, 128, 
};

#define PRE 7

void fidct(int *b, short *y)
{
	double B[64], B2[64];
	int i;
	for (i=0; i<64; i++) B[i]= b[i] * i2scale[i];	// scaling
	B[0]+= (1<<(3+PRE-1));	// rounding control

	for (i=0; i<8; i++) fidct1D(B+i*8, B2+i);
	for (i=0; i<8; i++) fidct1D(B2+i*8, B+i);
	
	for (i=0; i<64; i++) y[i] = CLIP[ (int)(B[i]/(1<<(3+PRE))) ];
}

4. Lets include α and β for x6 and x2 in the scaling matrix: v = (1, 1, β, γ, 1, γ, α, 1)

double f3scale [8*8] = {	// alpha, beta, gamma
	1,1,1.306562965,1.414213562,1,1.414213562,0.5411961,1,
	1,1,1.306562965,1.414213562,1,1.414213562,0.5411961,1,
	1.306562965,1.306562965,1.707106781,1.847759065,1.306562965,1.847759065,0.707106781,1.306562965,
	1.414213562,1.414213562,1.847759065,2,1.414213562,2,0.765366865,1.414213562,
	1,1,1.306562965,1.414213562,1,1.414213562,0.5411961,1,
	1.414213562,1.414213562,1.847759065,2,1.414213562,2,0.765366865,1.414213562,
	0.5411961,0.5411961,0.707106781,0.765366865,0.5411961,0.765366865,0.292893219,0.5411961,
	1,1,1.306562965,1.414213562,1,1.414213562,0.5411961,1
};

static void fidct1D(double *x, double *y)
{
	double p, n;

	FXADD(x[1],x[7],x[5],x[3]);
	FXMUL(x[1],x[7],k17,s17);
	FXMUL(x[5],x[3],k53,s53);

	//FXMUL(x[6],x[2],k62,s62); Reduced to one multiplication		
	n=x[2]-x[6];
	x[6]+=x[2];
	x[2]= M_SQRT2*n - x[6];
	FXADD(x[0],x[4],x[2],x[6]);

	y[0*8]= x[0]+x[1];
	y[1*8]= x[4]+x[5];
	y[2*8]= x[2]+x[3];
	y[3*8]= x[6]+x[7];
	y[4*8]= x[6]-x[7];
	y[5*8]= x[2]-x[3];
	y[6*8]= x[4]-x[5];
	y[7*8]= x[0]-x[1];
}

#define PRE 0

void fidct(int *b, short *y)
{
	double B[64], B2[64];
	int i;
	for (i=0; i<64; i++) B[i]= b[i] * f3scale[i];	// scaling
	B[0]+= (1<<(3+PRE-1));	// rounding control

	for (i=0; i<8; i++) fidct1D(B+i*8, B2+i);
	for (i=0; i<8; i++) fidct1D(B2+i*8, B+i);
	
	for (i=0; i<64; i++) y[i] = CLIP[ (int)(B[i]/(1<<(3+PRE))) ];
}

5. Lets do something for the odd part

Eg. scale x1, x3, x5 and x7 by η, so multiplication by η falls out

v = (1, η, β, γη, 1, γη, α, η)

Then we need 3 new constants:

s17k17 = s17/k17
k53k17 = k53/k17
s53k17 = s53/k17

double f4scale [8*8] = {	// alpha, beta, gamma, eta
	1,0.831469612,1.306562965,1.175875602,1,1.175875602,0.5411961,0.831469612,
	0.831469612,0.691341716,1.086367402,0.977704831,0.831469612,0.977704831,0.449988112,0.691341716,
	1.306562965,1.086367402,1.707106781,1.536355513,1.306562965,1.536355513,0.707106781,1.086367402,
	1.175875602,0.977704831,1.536355513,1.382683432,1.175875602,1.382683432,0.63637929,0.977704831,
	1,0.831469612,1.306562965,1.175875602,1,1.175875602,0.5411961,0.831469612,
	1.175875602,0.977704831,1.536355513,1.382683432,1.175875602,1.382683432,0.63637929,0.977704831,
	0.5411961,0.449988112,0.707106781,0.63637929,0.5411961,0.63637929,0.292893219,0.449988112,
	0.831469612,0.691341716,1.086367402,0.977704831,0.831469612,0.977704831,0.449988112,0.691341716,
};

static void fidct1D(double *x, double *y)
{
	double p, n;

	FXADD(x[1],x[7],x[5],x[3]);
	n=x[1], x[1]+= s17k17*x[7], x[7]-= s17k17*n;	//FXMUL eta (k17) included
	FXMUL(x[5],x[3],k53k17,s53k17);
	
	n=x[2]-x[6], x[6]+=x[2], x[2]= M_SQRT2*n - x[6];
	FXADD(x[0],x[4],x[2],x[6]);

	y[0*8]= x[0]+x[1];
	y[1*8]= x[4]+x[5];
	y[2*8]= x[2]+x[3];
	y[3*8]= x[6]+x[7];
	y[4*8]= x[6]-x[7];
	y[5*8]= x[2]-x[3];
	y[6*8]= x[4]-x[5];
	y[7*8]= x[0]-x[1];
}

6. And prepare the integer version, eg. 210:

int iscale [8*8] = {	// <<10
	1024,851,1338,1204,1024,1204,554,851,
	851,708,1112,1001,851,1001,461,708,
	1338,1112,1748,1573,1338,1573,724,1112,
	1204,1001,1573,1416,1204,1416,652,1001,
	1024,851,1338,1204,1024,1204,554,851,
	1204,1001,1573,1416,1204,1416,652,1001,
	554,461,724,652,554,652,300,461,
	851,708,1112,1001,851,1001,461,708
};

#define PRE 10

void fidct8bit(int *b, short *y)
{
	double B[64], B2[64];
	int i;
	for (i=0; i<64; i++) B[i]= b[i] * iscale[i];	// scaling
	B[0]+= (1<<(3+PRE-1));	// rounding control

	for (i=0; i<8; i++) fidct1D(B+i*8, B2+i);
	for (i=0; i<8; i++) fidct1D(B2+i*8, B+i);
	
	for (i=0; i<64; i++) y[i] = CLIP[ (int)(B[i]/(1<<(3+PRE))) ];
}


Summary

Using scaled coefficients before the IDCT can be a significant improvement in speed.

The idea is that scaling happens once for the whole 8x8 block together with dequantization (both are multiplications), simplifying computations for the modified 1D transform function. As if using a scaled-quantization table.

For the integer version, only an extra post-scale stage is required in the form of right shift:

Scaled dequantization --> IDCT --> right shift

Thu May 19 17:40:33 UTC+0200 2016 © Attila Tarpai