|
|
As part of a homebrew jpeg decoder I had to deal with the Inverse Discrete Cosine Transform, IDCT. Pretty interesting and hard topic, especially to implement it in C that is small and fast..
It has a _huge_ literature: it's the heart of the main methods used to compress video and audio data in jpeg/mpeg/audio, both in software and silicon.
Analysis
- Discrete Cosine Transform (DCT) was first introduced in 1974 by Ahmed et al. using real numbers.
- Few fast algorithms for the one-dimensional DCT (1-D DCT), the classical DCT/IDCT algorithms:
- Chen ‘77 the first one
- Wang ‘84
- Lee ‘84
- Vetterli ‘84
- AAN ‘88
- LLM ‘89
- all uses 12 multiplications and 29 additions
- LLM 1989 (C. Loeffler, A. Ligtenberg, and G. S. Moschytz, Practical fast 1-D DCT algorithms with 11 multiplications)
I’ve been fascinated by Löffler’s paper, its clean and simple solution. After a few but strict additions/multiplications the result appears on the ‘right side’ of this figure:

Image source: Reznik, Hindsy, Zhangz, Yuz, and Ni, Efficient Fixed-Point Approximations of the 8x8 Inverse Discrete Cosine Transform
It uses 7 constants (see below) to do it for 8 input values (as jpeg compresses by 8x8 pixel blocks).
Here is an example implementation in C, integer (fix-point) arithmetics used, we assume that our cpu can multiply too. I know that this routine can be improved on many ways... but this is the shortest I can come up with:
#define xadd3(xa,xb,xc,xd,h) p=xa+xb, n=xa-xb, xa=p+xc+h, xb=n+xd+h, xc=p-xc+h, xd=n-xd+h // triple-butterfly-add (and possible rounding) #define xmul(xa,xb,k1,k2,sh) n=k1*(xa+xb), p=xa, xa=(n+(k2-k1)*xb)>>sh, xb=(n-(k2+k1)*p)>>sh // butterfly-mul equ.(2)
static void idct1(int *x, int *y, int ps, int half) // 1D-IDCT { int p, n; x[0]<<=9, x[1]<<=7, x[3]*=181, x[4]<<=9, x[5]*=181, x[7]<<=7; xmul(x[6],x[2],277,669,0); xadd3(x[0],x[4],x[6],x[2],half); xadd3(x[1],x[7],x[3],x[5],0); xmul(x[5],x[3],251,50,6); xmul(x[1],x[7],213,142,6); y[0*8]=(x[0]+x[1])>>ps; y[1*8]=(x[4]+x[5])>>ps; y[2*8]=(x[2]+x[3])>>ps; y[3*8]=(x[6]+x[7])>>ps; y[4*8]=(x[6]-x[7])>>ps; y[5*8]=(x[2]-x[3])>>ps; y[6*8]=(x[4]-x[5])>>ps; y[7*8]=(x[0]-x[1])>>ps; }
void idct(int *b) // 2D 8x8 IDCT { int i, b2[64]; for (i=0; i<8; i++) idct1(b+i*8, b2+i, 9, 1<<8); // row for (i=0; i<8; i++) idct1(b2+i*8, b+i, 12, 1<<11); // col }
Notes:
- one common 1D-function:
- temp matrix required (transposition)
- probable slower than with 2 separate row/col functions
- more compact and nicer code :)
- two 1D functions:
- could be included in the 2D
- faster
- less storage: no temp matrix required
- little more mess in code
- This is exactly the 1:1 implementation of LLM IDCT
- fix-point arithmetics:
- first left-shift = pre-scale
- add operation: operands requires same pre-scaling
- mul: operands can be of different pre-scale
Little about fix-point multiplication
Lets see eg. how to calculate 15 * 0.2 = 3 on an integer machine:
- pre-scale
- operation
- post-scale
The value 0.2 cannot be represented as an integer number, so we use a multiplied (pre-scaled) value instead for the calculation:
- 0.2 * 10 = 2 (pre-scale non-integer to get integer)
- 15 * 2 = 30 (operation, result is ‘scaled’)
- 30 / 10 = 3 (post-scale, proper result!)
We also assume that an integer divide truncates the result, not rounding, that has to be done ‘by hand’ by adding the half before divide (5 in this case). Here is fixmul that multiplies a by b, where s is the pre-scaled state of one of the operands:
#define fixmul(a,b,s) (((a*b)+s/2) / s)
Eg. to calculate 36 * 0.2 = 7.2 ~7 (the result is integer too), I’d pre-scale by 10 and pass this to fixmul:
fixmul(36,20,10), which returns me 7.
In IDCT we work with constants: we can chose the best binary-scaled value and just watch out scaling during the operations.
In the case of LLM IDCT there is an addition and a multplication.
The addition adds/substracts 4 integer values to/from each other. These operands should be in the same ‘scale’, any, doesn’t matter which one.
In a multiplication the operands can be in different ‘scale’ - only that the post-scale includes both to get the good result. Eg. one operand is scaled by 2, the other by 4:
(12) * (20) = 240 (12*2) * (20*4)= 1920 / 8 = 240 = (12*20)*(2*4)
In the case of LLM IDCT, values of ‘a’ are integer, ‘b’ is a binary-scaled constant, ‘s’ is the scale value.
Binary scaling
Why? Because multiplication/division will be bitwise shift left/right, which operations are part of the basic integer arithmetics of the cpu!
0.2 * 16 = 3.2 ~ 3 (0.2 rounding error unfortunately, 0.2 is not a ‘good’ binary-scale canditate..)
- 0.2 * 16 ~3
- 15 * 3 = 45
- 45 / 16 = 2.8 ~2
- or with ‘proper’ rounding: (45+8) /16=3
Try the same by 64 (when -0.2 rounding error), which gives the same result:
- 0.2 * 64 ~ 12
- 15 * 12 = 180
- (180 + 32 ) / 64 = 3.31 ~ 3
These can cause some ‘error’ along the line (note that the error depends on the actual value):
LLM
A very simple (even I could understand ;) algorithm from 1989 to compute DCT/IDCT, when your cpu can integer add/sub, left/right-shift and multiply.
1-D IDCT
Given an input of 8 integer coefficients..
The IDCT implementation
On both halves there are butterfly-additions and butterfly-multiplications. Luckily with proper arguments the operations are the same, hence 2 #define is enough.
Arguments are properly pre-scaled before add/mul.
The constants
sqr(2)
That looks nice (ie. close to an integer value) on scale 2**7, when sqr(2)*128=181.02 ~ 181. This is a good start for X5 and X3.
Therefore the other 2 variables, X1 and X7 is pre-scaled by 128 to be on the same level.
cos(PI/16) and sin(PI/16)
Because of the butterfly these real constants (0,98078528 and 0,195090322) have to be pre-scaled by the same. On 2**7 they are both close to an integer value: 251 and 50 respectively.
cos(3*PI/16) and sin(3*PI/16)
0,831469612 and 0,555570233 both scale very ‘bad’ binary.. Really doesn’t matter what scaling, there is error.
sqr(2)*cos(6*PI/16) and sqr(2)*sin(6*PI/16)
The even part’s multiply butterfly. Best scaled on 2**9 (512).
Some critisism
The problem is that because of the 11 multiplication instead of 12 (equ. 2), we multiply by k1, (k2+k1) and (k2-k1).. 3 different new constants that have to be scaled same.. looking at my table (excel) it is hard to find a line where all 3 are close to an integer value.
Improvements
With some clever pre-scaling of the input values, these integer constants and integer arithmetic could work with much less error.. I think this is what eg. Reznik did.. and hundreds of others; the DCT/IDCT literature is really endless, the whole jpeg/mpeg is based on it! No wonder an interesting area.
|
|