Life in Motion: Estimation Library (Part I, Served with a Clockwork Fish)

Can you admit that you like watching various TV series, Hollywood blockbusters and DVD or Blu-ray bestsellers? All those gruff doctors lost in sunny southern California among desperate housewives? To a certain degree, perhaps we are all addicted to television and the motion pictures! We should then thank ourselves for the blessings of civilization: multimegabit channels, multi-terabyte drives, multi-inch monitors, and multi-core processors. Today, a recording of a TV series might fit into 500MB of space; what a trifle! Although just 10 years ago a system administrator might have blown a whole office to bits for the space to fit a single file like that in. Life is getting better but progress is unable to keep up with consumption, as usual: Terabytes and Megabits still cost a lot.

Let’s go back to our TV show season: 0.5GB per series, 40 minutes, 30 frames per second: and a DVDRip creates a cost of about 36GB of pure video (btw, could you validate or argue my calculations?). That’s a lot of used storage space! You’ll hardly find free disk space for the video of even a couple of seasons – go ahead and check it right now. And that’s not considering the download time.

Fortunately, video encoding comes to the rescue. Encoders can turn 36GB of video into 0.5GB. That is, they can compress the information about 72 times.

How is it possible to do this? Let’s try to figure it out.

The principles of video data compression are based on the human perception of video sequence (Human Visual Sense, or HVS). It’s not that hard to explain these principles in words:

1) Delete all “redundant” information, that is, the details of the image which aren’t ever going to be noticed anyway; and,

2) Find recurrent information in the video stream and minimize duplication.

The second item involves the so-called “statistical redundancy.” The HVS standard is normally divided into both space and time redundancies. Space redundancy includes considerable areas of the image with virtually the same pixels, such as the sky. Such areas are compressed using the principles close to JPEG compression for photos.

The situation with time redundancy is somewhat more complicated. Let us recall that the human eye can see about 25-30 frames every second. How close are these frames to each other? Unless the frames feature a complete change of scene, they are usually fairly close. Therefore, there’s no point in coding each frame separately. To minimize the volume of data stored (transmitted, processed, and so on) some block video encoding standards, MPEG4 in particular, code only the residual differences between a block and the next block in the keyframe.

From the encoder internals point of view, there are two main functional units. The first is the Discrete Cosine Transformation (DCT), which allows the removal of the space redundancy of a static image. The second one is Motion Estimation (ME) which, as described above, reduces the time redundancy of subsequent video frames. When the encoder gets a video input, its motion estimation and motion compensation reduce the volume of information that is identical for two adjacent frames. Then the DCT and the quantization minimize identical information within the frame. The quantization ratios, the transformed data together with the motion vectors are all fed into the entropy encoder for final compression. The better the motion estimation is, the less work for the DCT. That is to say, better motion estimation facilitates better video compression.

It looks simple, doesn’t it? However, there are some hidden barriers and reefs, which we will now try to deal with. Before the deep dive, can you guess how much redundant information is in this video (a cool movie, incidentally)?

Let’s continue our theoretical digression. You can use various methods to do motion estimation. The so-called pixel-recursive methods allow calculation of motion vectors for each separate pixel. It’s also worth mentioning the Phase Plane Correlation technique, which generates motion vectors based on correlation of the current frame with the reference one. Yet the most popular algorithm is the Block Matching Algorithm (BMA).

The BMA calculates the motion vector not for separate pixels, but for pixel blocks. Objects in a video are in fact pixel groups; therefore pixels within the block ultimately have the same motion vector. This helps to considerably reduce the amount of calculation and to get more accurate results.

An illustration of the BMA algorithm is provided in picture above. The current frame is split into pixel blocks, and we’re going to estimate the motion of each block individually. The idea is to find the most similar block to a given block in the reference frame. Then we need to calculate a motion vector, a shift of the block with respect to its “original” location.

Thus, motion vectors represent horizontal and vertical shift of the current pixel block by x and y:

In this equation I is the intensity of the pixel in position x on frame n, and d is the shift (motion vector) with respect to the initial position on frame n+1.
Various pixel block matching criteria (Displaced Frame Difference, or DFD) can minimize the function:

The two most popular are:

Sum of Square Error (SSE):

Sum of Absolute Difference (SAD):

The SSE function is known for accurate matching, but is more computationally intensive. The SAD function is not particularly good at matching, but requires less computation, so it is often used in BMA algorithms. The functions also deal with criteria such as cross correlation and maximum number of matching pixels.

The functions select the reference pixel block from the so-called search area. The search area determines the boundaries for the motion vectors and limits the number of candidate reference blocks to review. The height and width of the search area depend on many factors, such as motion speed and computational complexity. A wider search area requires more computation due to greater number of candidate blocks. The general practice is to select a wider area since horizontal motion in video happens more frequently. Search area may be also adapted to the motion type.

The Full Search (FS) algorithm is a BMA-type algorithm that processes every possible pixel block within the search area and finds the best possible motion vector. The method provides the best possible residual difference for video compression by the cost of extra computational workload. It is the best possible algorithm for matching blocks identification, and luckily, it fits very well with multithreaded GPU architecture.

Now it is time to get down to practice! We’re going to develop a library for motion estimation that would be convenient to use and, would take into account the specifics of various video encoding standards. We’ve got to create an interface that would assume the existence of various algorithm implementations based on various technologies, such as software-based and GPU-based implementations.

Well, that was complicated! Before we jump into the code, let me ask: when you look at the video above, can you see the handle knob? What do you think about its behavior from the motion estimation perspective? Will the motion vectors trace the handle knob movement trajectory, or do you think it is still too complicated for modern algorithms?

The library interface should consider various aspects of the algorithm and the data. We know that modern video encoding standards are block-based, so that each frame is divided into macroblocks (that are also split into various powers of two). We also know that there are several types of motion estimation algorithms, and their implementations differ considerably. Standards determine the method for estimation of frames interpolated by a noninteger number of pixels. The input consists of a pure frame for motion estimate and reference frames to be used for estimating the source frame. The expected output consists of motion vectors with linked SAD values.

We are going to create an interface considering all these details. Please note that we’re going to use the Intel® Integrated Performance Primitives library because it contains lots of useful and very fast functions, including, in particular, the interpolation function and SAD calculation function.

#if !defined(__HARDWARE_ME_H)
#define __HARDWARE_ME_H

/* this implementation uses IPP defined types and structs */

#if defined(__cplusplus)
extern \"C\"
#endif /* defined(__cplusplus) */

/* Declare the types used in the accelerated ME object */
typedef struct _HW_ME_HANDLE *hw_me_handle_t;



enum eHWMECodecStandard
HW_ME_MPEG2 = 0,
HW_ME_MPEG4 = 1,
HW_ME_VC1 = 2,
HW_ME_H264 = 3

} eHWMECodecStandard;

enum eHWMEAlgorithm

} eHWMEAlgorithm;

enum eHWMEType

} eHWMEType;

enum eHWMEImageStructure

} eHWMEImageStructure;

enum eHWMEDivType
HW_ME_DIV_16X16 = 0,
HW_ME_DIV_16X8 = 1,
HW_ME_DIV_8X16 = 2,
HW_ME_DIV_8X8 = 3,
HW_ME_DIV_8X4 = 4,
HW_ME_DIV_4X8 = 5,
HW_ME_DIV_4X4 = 6

} eHWMEDivType;

enum eHWMEPredictionType

} eHWMEPredictionType;

struct HW_ME_RANGE
Ipp32s left;
Ipp32s right;
Ipp32s top;
Ipp32s bottom;

} eHWMERange;

Ipp32u structSize; /* (Ipp32u) size of the current struct */
IppiSize imageSize; /* (IppiSize) size of processed images */
Ipp32u maxNumRef; /* (Ipp32u) maximum number of available references */
IppiPoint maxVector; /* (IppiPoint) maximum allowed vector */
eHWMEDivType minSubblockSize; /* (eHWMEDivType) minimal subblock size */
IppiPoint maxSubBlockVectorDif; /* (IppiPoint) maximum vector difference between subblocks of macroblock */
eHWMECodecStandard codecStandard; /* (eHWMECodecStandard) interpixel interpolation standard */
eHWMEType meType; /* (eHWMEType) type of interpolation */
eHWMEAlgorithm algType; /* (eHWMEAlgorithm) algorithm type */

IppBool strictWithinFrame; /* (IppBool) is it allow to be outside frame or not */


struct HW_ME_IMAGE
eHWMEImageStructure imageStructure; /* (eHWMEImageStructure) current image structure */
Ipp8u *pPlanes[3]; /* (Ipp8u *([])) array of pointers to image's planes */
Ipp32s imageStep; /* (Ipp32s) image step */


struct HW_ME_MB_INFO
eHWMEDivType mbDiv; /* (eHWEDivType) macro block fragmentation */
eHWMEDivType subblockDiv[4]; /* (eHWEDivType) sub macro block fragmentation */
eHWMEPredictionType predType[4]; /* (eHWMEPredictionType) macro block prediction types */
Ipp32s minSAD; /* (Ipp32s) minimal found SAD for the macroblock */
Ipp32u ref[2][4]; /* (Ipp32u [][]) reference list for macro block */
IppiPoint mv[2][16]; /* (IppiPoint [][]) motion vector list */


struct HW_ME_INFO
HW_ME_MB_INFO *pMbInfo; /* (HW_ME_MB_INFO *) pointer to macroblock info store */

Ipp32u numFrameRefs[2]; /* (Ipp32u) number of references in reference array */
Ipp32u refFrameList[2][16]; /* (Ipp32u [][]) reference list for each direction */

Ipp32u numTopFieldRefs[2]; /* (Ipp32u) number of references in reference array */
Ipp32u refTopFieldList[2][32]; /* (Ipp32u [][]) reference list for each direction */

Ipp32u numBottomFieldRefs[2]; /* (Ipp32u) number of references in reference array */
Ipp32u refBottomFieldList[2][32]; /* (Ipp32u [][]) reference list for each direction */


/* Initialize the accelerated ME object */
IppStatus InitializeHWME(const HW_ME_PARAMETERS *pHWMEParams, hw_me_handle_t *pHandle);

/* Stop all ME operation and release the accelerated ME object */
IppStatus ReleaseHWME(hw_me_handle_t handle);

/* Add a reference to the accelerated ME object */
IppStatus CopyHWMEReference(hw_me_handle_t handle, const Ipp32u refIdx, const HW_ME_IMAGE *pImage);

/* Add a source image to the accelerated ME object */
IppStatus CopyHWMESource(hw_me_handle_t handle, const HW_ME_IMAGE *pImage);

/* Do a accelerated ME */
IppStatus DoHWME(hw_me_handle_t handle, HW_ME_INFO *pInfo, void **ppEvent = NULL);
IppStatus WaitHWMEDone(void *pEvent);

#if defined(__cplusplus)
} // extern \"C\"
#endif /* defined(__cplusplus) */

#endif /* __HARDWARE_ME_H */
For more complete information about compiler optimizations, see our Optimization Notice.


To the question posed in your posed, the answer depends on the size of block used for template matching or referencing. At pixel level, Kalman filtering technique can be used to predict pixel movement as the motion is constant and the knob experiences a rigid transformation. For the methods presented, the knob can be traced. I think we use B-frames to predict motion.

To the question posed in your posed, the answer depends on the size of block used for template matching or referencing. At pixel level, Kalman filtering technique can be used to predict pixel movement as the motion is constant and the knob experiences a rigid transformation. For the methods presented, the knob can be traced. I think we use B-frames to predict motion. Just a thought.