In the first part of this epic we investigated the substance of the issue and promised to build a two dimensional transform. Now is the time to get down to it.
Remember that we require a mathematical transformation that will be reversible, efficient, and de-correlated at the same time. The last two requirements mean that most information must be concentrated in a small number of coefficients, and the interdependency among the coefficients must be weak (the coefficients must be de-correlated). We also have to consider the efficiency of the transform in terms of memory footprint, number of arithmetic operations, and hardware implementation difficulty.
Since modern video encoding standards operate in blocks as frame splitting units, it would be logical to require the transformation to be of block nature as well. That is to say, the transform has to operate on NxN sample blocks.
Actually, there exist both block and full transforms. Both have certain advantages and disadvantages. For example, full transformations require a large amount of memory, but yield better results when applied to static content. Block transforms, on the other hand, require little memory, but produce a negative visual effect. What kind of effect?
Let us get back to the concept of transform. The essence of a mathematical transformation consists in changing the representation of a value, changing its format, as a result of which it acquires certain useful characteristics. In our particular case, the transformation has to convert a set of correlated values into de-correlated, or statistically independent values, allowing us to automatically eliminate statistical redundancy.
We will mostly consider orthogonal transformations. Remember that our transformation requirements include efficiency in terms of memory footprint, number of arithmetical operations, and simplicity of hardware implementation. This means that the transform has to be linear. Thereby, the transformed values b(i) have to be produced by a linear combination of source values a(i) and transformation coefficients d(i,j):
However, linearity simply constitutes one of the features of an orthogonal transformation, but does not determine it. In order for a transform to be an orthogonal one, its transformation matrix has to consist of mutually orthogonal vectors. Why? Let’s consider another example.
Consider a 4x4 block of luminance samples. Let’s refer to it as A.
It is accepted practice in video encoding to operate blocks of 8x8. We have chosen a block four times smaller in order to simplify the example. But why 8x8 and no more? Isn’t it better to start with a block of 64x64?
Let us listen to signals from space and choose the following transformation matrix:
What’s the deal? Where did the ½ come from? This is a requirement of the energy conservation principle.What?!
Let’s apply the transformation to the current block (A):
Note that the elements in the first line have become dominant, while the other elements have been considerably reduced. The transformation has affected each column of the source sample value matrix. But we’ve got to concentrate the information predominantly in the first coefficient. Therefore we are going to apply the transformation one more time to the lines of the sample matrix, first transposing the matrix, as it is fortunately symmetric.
We have achieved the required results, concentrating all the energy in the first element. Note that the transformation is reversible, that is to say we can always get the original values.
Gradient fans will like the visualization (before and after):
So, what consideration determines the choice of matrix?
First and foremost, it is determined by logical prerequisites. We know what we need to get – a matrix with a specific distribution of energy, with the energy concentrated in the upper left block.
For this purpose we need the first basis vector of the transformation matrix to strengthen the sample values. Since sample luminance values may not be negative, the basis vector must also contain only positive values. In our case it is the vector (1, 1, 1, 1). Logical, right?
The remaining part of the coefficients matrix must contain little additional information. This is achieved by varying the frequency of changes in the sign of the transformation matrix coefficients. We know that the original samples have very similar values, therefore, if we choose a transformation vector with half of its values negative and the other half positive, the resulting linear combination of numbers with different signs and close modules will result in a small figure.
And the final thought is that the remaining three basis vectors have to be different to comprise a basis of space. In other words, they have to be orthogonal (linearly independent). Those are the vectors (1, 1, -1, -1), (1, -1, -1, 1) и (1, -1, 1, -1). They have varying signs and are linearly independent.
Thus we have built a transform satisfying all the aforesaid requirements.
An observant reader must long have noticed the abbreviation DCT in the headline. What is it? I’ve got no idea myself. We’ll talk about it next.