Unfinished..
It is claimed that inter prediction may result in less error from sub-pixel positions. These sample values has to be computed by some kind of interpolation during encoding/decoding to form the N x M
prediction block.
That is moving the search block fractional pixels and compute the prediction error using interpolated sample values during motion estimation by the encoder. On the decoder side using sub-pixel precision motion vectors (mv), sample values are computed by the same interpolation.
Example: this 4x4 area of the image moving half pixel to the right:
X X X X X o X o X o X o X X X X X o X o X o X o X: sample ---> o: interpolated sample X X X X X o X o X o X o X X X X X o X o X o X o
What are the values of o
?
The simplest way of computing values of o
in this case is by averaging the left- and right neighbours (a special case of linear interpolation where distances from neighbours are the same):
A o B o = ( A + B ) / 2
On the right edge the reference block has to be extended to (N+1)
to provide neighbours for the edge-samples:
X o X o X o X o X X o X o X o X o X 5x4 ref block --> 4x4 linear interpolated X o X o X o X o X X o X o X o X o X
The vertical case is similar, extended to (M+1)
:
X X X X o o o o X X X X 4x5 ref block --> 4x4 linear interpolated o o o o X X X X o o o o X X X X o o o o X X X X
When both the vertical- and horizontal movement is fractional, the reference block is extended to (N+1)x(M+1)
:
X X X X X o o o o X X X X X 5x5 ref block --> 4x4 bilinear interpolated o o o o X X X X X o o o o X X X X X o o o o X X X X X
Bilinear interpolation is used, which needs 4 neighbours:
A B \ / o = ( A + B + C + D ) / 4 o / \ C D
Another way of interpolation is to look beyond the closest neighbours and calulate some weighted average. This is called Finite Impulse Response filtering (FIR) and belongs to DSP. Samples are called tap.
Example 6-tap FIR of AVC. We extend the neighbourhood from 2 to 6 samples to calculate o
using weighted average (weights are constant for AVC):
1 -5 20 20 -5 1 --------------------- R S A o B K L o = ( 1 x M - 5 x N + 20 x A + 20 x B - 5 x Q + 1 x R ) / 32
For AVC 6-tap, the reference block is extended left (by 2) and right (by 3) to provide 3+3 neighbours for the edge-samples:
+2 +3 +-------+-------------+-----------+ | | | | |X X |X o X o X o X|o X X X| | | | | +-------+-------------+-----------+
Example 8-tap FIR of MPEG-4. Extend neighbourhood to 8 samples to calculate o
:
-8 24 -48 160 160 -48 24 -8 ------------------------------- Q R S A o B K L M o = ( -8 x Q + 24 x R - 48 x S + 160 x A + 160 x B - 48 x K + 24 x L - 8 x M ) / 256
MPEG-4 8-tap doesn't extend the reference block beyond bilinear (N+1)x(M+1), it uses boundary mirroring of 3 edge samples to provide 4+4 neighbours:
+1 +-------------+---+ | | | C B A |A o B o C o D|o E| E D C | | | +-------------+---+
Other FIR used in video compression standards:
A B C Integer sample D E F G H I H.261, MPEG-1
A . B . C . . . . . . . D . E . F . Half sample . . . . . . G . H . I . MPEG-1/2, H.263, MPEG-4 bilinear interpolation . . . . . .
A . . . B . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . Quarter sample . . . . . . . . . . . . D . . . E . . . F . . . AVC luma 6-tap FIR . . . . . . . . . . . . MPEG-4 luma 8-tap FIR . . . . . . . . . . . . . . . . . . . . . . . . G . . . H . . . I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A . . . . . . . B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1/8 sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . AVC 4:2:0 sub-sampled chroma bilinear interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D . . . . . . . E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motion vector component is a 2's-complement integer.
The integer value interpreted as integer- and fractional part.
Here is a quarter precision mv horizontal component of +/-11:
-11 +11 <-------------------------------+-------------------------------> | X . . . X . . . X . . . X . . . X . . . X . . . X . . . X . | | | | | | | curr | | |___________| |___________| -3 -2 -1 0 1 2 3
Moving +/-11 quarter pixels results in different integer left neighbour (start position for sample fetch):
Moving +/-11 quarter pixels results in different fractional position (different interpolation equations):
Fortunately, CPU integer operations asl
and and
gives exactly the correct results and is very fast! This is >>
and &
in C.
mv >> 2 & 3 frac ... 11111110 11 -5 -2 3 3/4 11111111 00 -4 -1 0 0 11111111 01 -3 -1 1 1/4 11111111 10 -2 -1 2 1/2 11111111 11 -1 -1 3 3/4 00000000 00 0 0 0 00000000 01 1 0 1 1/4 00000000 10 2 0 2 1/2 00000000 11 3 0 3 3/4 00000001 00 4 1 0 0 00000001 01 5 1 1 1/4 ...
And it works for all precision:
Int Frac Half sample precision >> 1 & 1 Quarter sample precision >> 2 & 3 Eights sample precision >> 3 & 7
To predict an N x M block by interpolation requires an extended reference sample block, depending on the method.
For all (bi)linear and peculiarly MPEG-4 ASP 8-tap:
(N+1)x(M+1)
from (Int_{x}; Int_{y})
bi h v +1 +1 +-------------------+-+ +-------------------+-+ +-------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +-------------------+-+ +-------------------+-+ +-------------------+ +1 +-------------------+-+ +1 +-------------------+
For AVC 6-tap luma:
(N+5)x(M+5)
from (Int_{x} - 2; Int_{y} - 2)
bi h v -2 +3 +-+-+-------------------+-+-+-+ +-------------------+ -2 +-+-+-------------------+-+-+-+ -2 +3 -2 +-------------------+ +-+-+-------------------+-+-+-+ +-+-+-------------------+-+-+-+ +-------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +-+-+-------------------+-+-+-+ +-+-+-------------------+-+-+-+ +-------------------+ +-+-+-------------------+-+-+-+ +-------------------+ +3 +-+-+-------------------+-+-+-+ +3 +-------------------+ +-+-+-------------------+-+-+-+ +-------------------+
The horizontal/vertical cases are just optimizations to reduce pixel fetch from the reference buffers.
A true linear interpolation: 1/8 samples are computed by weighted average of 4 neghbours. Weights are distances:
X . . . . . . . X . . . . . . . . X: integer samples . . . . . . . . .: 8th fractional samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X X
To make 8^{th} precision mv from 4^{th} precision luma mv depends on sub-sampling.
When sub-sampled (eg. 4:2:0 vertical): keep the quarter luma mv as-is but with different int/frac:
int frac int frac 100101010 10 ----> 10010101 010
When not sub-sampled (eg. 4:2:2 vertical): either opt it out or to use the same 8^{th} routines, scale the vertical component by <<1, so int is the same but frac is doubled:
int frac int frac 100101010 10 ----> 100101010 100
- 1/2 + 1/2 <-- --> | | | | -2 -1 | 0 | 1 2 ---------------+-----+--------------- X X o X o X X | | | |_____|_____| | |
From closest neighbours:
A - h - B | \ / h = ( A + B ) / 2 v b v = ( A + C ) / 2 | / \ b = ( A + B + C + D ) / 4 C D
or by CPU integer arithmetics:
o = ( A + B + 1) >> 1
Here o
is by averaging 4 neighbours (again, a special case of bilinear interpolation where distance from neighbours are the same):
o = ( A + B + C + D ) / 4 o = ( A + B + C + D + 2 ) >> 2
X - o - X | \ / o o | / \ X X
Reference 4x4 +----------------------+ | X o X o X o X | all possible half positions of interpolated 3x3 | o o o o o o | | X o X o X o X | | o o o o o o | | X o X o X o X | | o o o o o o | | X X X X | +----------------------+
Interpolated values are computed on-the-fly during prediction by the decoder.
Eg. below: xFrac=3, yFrac=2: Standard calls this k
value. How to calculate its value?
k
is there)
A . . . B . . . C . . . . . . . . . . . . . . . . k . . . k . . . . . . . . . . . . . . D . . . E . . . F . . . . . . . . . . . . . . . . k . . . k . . . . . . . . . . . . . . G . . . H . . . I . . . . . . . . . . . . .
k
equals average of j
and m
:
A . . . B . . . C . . . . . . . . . . . . . . . j k m . j k m . . . . . . . . . . . . . D . . . E . . . F . . . . . . . . . . . . . . . j k m . j k m . . . . . . . . . . . . . G . . . H . . . I . . . . . . . . . . . . .
m
is vertical 6-tap of reference samples
j
is the middle value: computed as tap-6-of-tap-6 horizontaly or verticaly
Because we need m
anyway at the end:
m_{1}
for the whole block as 6-tap
j
horizontally from m_{1}
as 6-tap
k
as avarage of j
and m
+---------+ +---------+ | | | | | | | +--|--------|---------|-------------+ | | | +---------+ | +---------+ | | | | | | | | | | | | | | | | | | +---------+ +---------+ +-----------------|---------|------|+ | | | | | | | | | +---------+ +---------+ +---------+ | | | | | | +---------+
.
0 0 0 1 2 0 0 0 1 2 +-----+-----+-----+-- 0 0 | 0 | 1 | 2 | +-----+-----+-----+-- 3 3 | 3 | 4 | 5 | +-----+-----+-----+-- 6 6 | 6 | 7 | 8 | +-----+-----+-----+-- | | | |
.
MPEG-4's quarter luma interpolation for half-position samples to keep the same ref sample numbers as half-sample interpolation, but using 8-tap: