# “How Fish Evolve into Squirrels”, now in 3D!

By dmitry.serkin (Intel), Updated

Up to now I’ve tried to put across the fundamental principles of video encoding methods “in hand-waving terms” (I’ve hardly got any more hands to wave now). I have touched on the basics of eliminating the time and space redundancy. Now I suggest we look at the 3D-DCT transform, which is such a great success, as an idea, at uniting both of the above concepts. Here is the mathematical description of the transform:

Those who have read the post on transformations – read it again! And you will see that _the main part_ the transformations works in 2D space. In particular, on rectangular 8x8 blocks.

**Note:**

As I’ve said before, there is a popular belief that larger blocks (for example, 16x16) yield better results. This may or may not be true. Ultimately, the performance depends on the frame structure (the amount of movement in it, presence or absence of static structures, etc.).

But what if I increase the number of dimensions to three? That is to say, in addition to vertical and horizontal dimensions, introduce the time domain? Here is the funny way mathematicians use to describe 2D-DCT:

Comparing the formulas, you’ll notice that in case of the 3D , you need to add one more summation cycle, with a normalizing factor and a cosine under the sum. Every point has three coordinates, which means that by introducing another dimension – the temporal one, we now have not a square, but a cube of values. Imagine a box containing eight 3.5” disks, where each disk is an 8x8 block, and together all the disks form a cube – our new atomic unit. One such box contains 8 blocks of 8x8 with the same location in 8 frames.

The connection between the three-dimensional transform with the two-dimensional one can be explained mathematically. In brief, the three-dimensional transform is subject to the so-called delta and epsilon modifications, accounting for the frame changes in time. We’ve discarded the mathematical presentation, because the local public appears to have no interest in them :).

Now, let’s look at 8 subsequent frames. There is virtually no movement. Only the fir on the squirrel’s tail stirs a little, and the reflections in the water and patches of light on the iceberg change. But there is no strong motion. In this case, it is quite reasonable to apply the three-dimensional Fourier transform.

**Note:**

This issue is never raised in the popular video encoding standards, as they all use the motion estimation unit. Still, for certain formats a considerable advantage in compression can be achieved. For example, the motion jpeg currently favored by numerous web-cameras which doesn’t involve any motion estimation mechanism at all.

Applying the 3D-DCT algorithm to this sequence of frames, we’ll get the following picture (luminance component only):

Eight frames are input into the algorithm, and working with each sample cube it generates coefficient matrices of the same dimensions. I’ve already mentioned the practicality (redundancy elimination) of these coefficients. It’s interesting to note that the first frame looks like the output of a 2D transform with strong blockiness and angular coefficients. The remaining frames have fewer blocks and contain coefficients of lower modules. Blocks appear only where the slightest motion is present. The outline of the motionless squirrel is clearly seen, all filled with zeroes. Thus, the first frame contains common information for the whole sequence, and the remaining seven frames contain the information on movement within the sequence.

The practicality of a 3D transform can be evaluated only in a fully functional codec (such as, for example, MJPEG), by experimenting with various video sequences and applying the common video encoding criteria. However, even now I can say that the algorithm considerably improves the compression level, at the price of a relatively insignificant increase in computational complexity, but at the cost of a significant increase in memory footprint (all the eight frames have to be stored in memory).

Well, now it’s certainly all I want to say about transforms. Well, almost all.