By Alex A. Lopez-Estrada
Applications Engineer, Intel Software and Solutions Group
Today, processor computing power makes real-time video encoding and decoding using General Purpose Processors capability a reality. Still, the ability to achieve real-time encoding/decoding with minimum CPU utilization presents a challenge to system architects, as most of the processors’ cycles are saved for more intriguing and extended user experiences. Currently, a performance of close to 2X real time, or 50% CPU utilization, is achievable. Thus, due to its computational and memory characteristics, video encoding is still today considered a “killer application” from a processor micro-architecture design perspective.
This paper presents an analysis of the memory access patterns of typical motion estimation algorithms in order to better understand processor micro-architecture issues that affect the overall performance of video compression algorithms.
A Typical Video Encoder
Figure 1 presents a high level picture of a video encoder, applicable to MPEG-2, MPEG-4 or H.264 standards. As digital “raw” video arrives at the encoding pipeline, a series of transforms are applied with the purpose of removing redundant information (lossless) or irrelevant information (lossy) exploiting characteristics of the human visual system (HVS). The idea is to achieve bit rate reduction without sacrificing the quality of the video signal.
Figure 1. Generic MPEG source coder
One important aspect of a video encoder is the removal of temporal redundancy. This is achieved by the Motion Estimation (ME) block in the encoder. The objective is to represent each video frame as a difference in pixel values from a reference frame. The reference frame can be one or more past frames in the video sequence or frames in the future. Frames, or frame portions, not predicted from references are nominated as INTRA, frames, or frame portions, predicted from a reference are known as INTER. The selection of whether to encode as INTRA or INTER is based on the temporal predictability of the frame (or frame portion) with respect to the reference.
Typical motion estimation algorithms account for a substantial portion of the application CPU cycles, and in many cases (depending on the complexity of the algorithms) it’s been known to account for as much as 60% of total CPU cycles. Therefore, this paper targets the analysis of the Motion Estimation algorithms in typical processor micro-architectures with on-chip memory, such as IA-32 architectures.
Motion Estimation Algorithms
A wide variety of ME algorithms exists, offering a tradeoff between encoding speed and video quality. In the ME stage, current and reference frames are broken into macro-blocks (MB), each typically of 16x16 pixels. A search is performed trying to locate the macro-block in the reference frame that satisfies some minimum error criteria fro m the current frame. This is commonly known as block matching. A popular error criteria used in ME is the Sum of Absolute Differences or SAD, defined as:
is the current frame macro-block, y
is the reference frame macro-block and ij
denotes row i
, column j
of the frame. Processor instructions have been introduced to optimally compute a SAD, with the purpose of improving ME performance. In SSE the instruction is called PSADBW, and computes one row of the SAD .
Once the best matching macro-block is located, a motion vector field (MVF) is encoded per macro-block corresponding to the absolute displacement, in pixel coordinates (x, y) from the current frame macro-block. The efficient search for the macro-block corresponding to the minimum SAD is the subject of on-going research in the area of video encoding. We explore three popular methods for performing the motion estimation stage: full search, diamond search and PMVFAST.
Full Search Block Matching
In Full Search (FS) ME, all possible overlapping locations (in a sliding fashion) around the co-located macro-block in the reference frame (see Figure 2) are evaluated and the position that results in the minimum SAD is used as the predictor. The search support area in the reference frame is denoted as X_SEARCH and Y_SEARCH. Therefore, the number of SAD’s to be computed increases proportionally with the search support area, e.g. N = X_SEARCH x Y_SEARCH. This method is the most computationally expensive but provides best coverage of the search area. However, this coverage depends on the selection of X_SEARCH and Y_SEARCH, and the algorithm may fail to find the optimum reference macro-block if it falls outside the search support.
Figure 2. Full-Search Motion Estimation
The Diamond Search algorithm was first proposed by Shan Zhu and Kai-Kuang Ma  and it is based on the study of motion vector field distributions of a large population of video sequences. The diamond pattern as presented in Figure 3 is derived from the probability distribution function as those locations corresponding to the highest probability of finding the matching block in the reference frame. The algorithm starts in the co-located macro-block in the reference frame and performs eight additional SAD’s around the diamond center. This contrasts to FS which computes all possible overlapping SAD’s. Once the minimum SAD location is found, the diamond center is displaced to the optimum location and a new diamond search is executed. The new search will require fewer SAD computations. The search stops once the position of the minimum SAD is located in the center of the diamond. On average the best matching block will be found within a few diamond search iterations, thus requiring fewer SAD’s per macroblock than FS. The algorithm offers the advantage of extending the search support area, allowing more reference fram e coverage with fewer computations. However, due to the sparse nature of the diamond, one may miss an optimum matched block near the center. Various algorithms based on DS have been proposed to do a hierarchical search, so that once the best matching macro-block is found an inner diamond search is performed to cover points interior to the diamond.
Figure 3. Diamond Search Motion Estimation Algorithm
Predictive algorithms use neighboring blocks and blocks on adjacent frames with already computed motion vectors to estimate the initial pattern of the search. Usually these algorithms use the prediction result to compute the first search and consecutive searches follow a prescribed pattern, i.e. a diamond search pattern. A popular predictive algorithm is abbreviated as PMVFAST for Predictive Motion Vector Field Adaptive Search Technique . In Figure 4 an instance of PMVFAST is presented, using a diamond search pattern as the underlying technique to be used after the first search. The algorithm uses the motion vectors estimated from the following 4 macro-blocks: the one directly at the top (Top MV), the one to the left (Left MV), the one to the upper right (Top-Right MV) and the co-located macro-block in the previous frame (MV from frame n-1). Additionally a median is computed between these 4 motion vectors to provide the 5th
position. On each position a SAD is computed and the minimum SAD is taken as the center for the next search. In Figure 4 say the Top-Right MV produces the minimum SAD, the bottom diagram shows the position of the diamond with respect to search 1. A regular diamond search is then executed. The general goal of this algorithm is faster convergence than regular DS.
Figure 4. Predictive Motion Vector Field Adaptive Search Technique (PMVFAST) using a Diamond Search pattern.
Motion estimation algorithms as presented here are computationally and memory intensive. In many implementations these algorithms are known to be load-bound as multiple frame positions are explored, requiring high execution to cache bandwidth. Therefore in the next section I describe the general operation of the caches in a typical IA processor in an attempt to map performance issues to the micro-architecture.
Figure 5. L1 Cache Operation
Intel® Architecture Caches
The Intel® micro-architecture can support up to three levels of on-chip cache. Only two levels of on-chip caches are implemented in the Pentium® 4 processor. The following table provi des the parameters for all cache levels for a Northwood processor with Hyper-Threading Technology.
Table 1. Pentium® 4 Processor Cache Parameters
1st Level Cache Operation
The 8 KB first level cache is arranged into four ways, each 2 KB in size with 32 lines of 64 bytes each (see Figure 5). A tag per cache line per way is also maintained in L1 cache. This tag contains the state of the cache line and a page tag that indicates to which page directory the cache line belongs.
Two problems arise from current L1 cache operation:
Motion Estimation Memory Access Model
As mentioned earlier, the basic building block in ME is the SAD computation. For a 16x16 block, 16 rows of 16 bytes from both the current and reference frames need to be loaded from 1st level cache to be able to do the computation. For each row in a macro-block there are 2 ALU computations: 1 PSADBW and 1 PADD. Therefore, there are a total of 2 loads (assuming no cache line split or unaligned access penalties) and 2 ALU’s per row. This becomes readily unbalanced if the latency due to 1st level cache access is large. This latency is difficult to mask as all operations depend on the data retrieved. With today’s architectures the performance of the operation is dependent on 1st level cache latency (load bound).
Furthermore, by analyzing the operations of a motion estimation algorithm one can make the following observations:
- While loads from the current frame can be guaranteed to be 16-byte aligned, the positions in the reference frame will be un-aligned more than 99% of the times in a full search algorithm and 67% of the times in diamond search algorithm (assuming first diamond only). This adds more load pressure to the 1st level cache.
- Depending on the width of the frame (the frame stride), each row to be loaded from memory may incur in a cache line split (see Section 0). Cache line splits can incur in larger load latencies, as they require loading from two cache lines and merging the results.
- Again, depending on the width of the frame, you may incur in cache thrashing as result of cache-associativity issues. While common video frame widths will not incur in this problem (see section 1.1.1), it is still considered.
- Address aliasing conflicts between current and reference frames are inevitable and non deterministic. This incurs in large penalties in Pentium 4 processors for 64 KB strides. Prescott aliasing conflicts are reduced to 4 MB strides.
Cache Line Splits
Depending on the width of a frame, the macro-block rows accessed during SAD computations may incur in cache line splits. Let’s assume a 64 byte cache line size and a 16x16 bytes macro-block. Since 64 is multiple of 16 (4 side by side macro-block rows per cache line), as long as the width of the frame is a multiple of 16, if the first byte of the macro-block is 64 byte cache aligned, each row will be guaranteed to fall within just one cache line, avoiding the cache split penalty. This is true for typical frame sizes (Std. Definition of 720x480, HD uses 1280x720, both widths being multiple of 16 bytes). This aligned access can only be guaranteed on accesses to the current frame. However, say the current frame macro-block is 16 byte aligned; only the access to the co-located macro-block in the reference frame will share this alignment. Once the block matching function moves around a search support area, the macro-block accesses may incur in cache line splits.
As an example, take the case of the standard definition width of 720. Each stride will have exactly 11.25 cache lines. This means that if the first row of a macro-block starts at cache line 0, the second row starts at cache line 11.25, the third row at 22.5, etc. Every four rows will be cache-aligned. In terms of byte offsets from the cache line, the second row will have an offset of 16 bytes, the third will be 32, the fourth 48, and finally the fifth will be cache-aligned (64). Table 2 presents the access pattern for the macro-block in the current frame and the co-located macro-block in the reference frame. Note that all the accesses are within one cache line, thus, no cache-line splits occur.
Table 2. Aligned SAD accesses in the current frame
Now let’s consider accesses on the reference frame. For diamond search, let’s label points 1-9 as shown on Figure 6, where point 5 corresponds to the co-located macro-block in the reference frame. Points 1, 5 and 9 share the cache alignment of the current frame macro-block, which is an offset of 0. Points 2 and 7 are displaced by -1 byte; 3 and 8 are displaced by +1 byte; point 4 is displaced by -2 bytes; and point 6 is displaced by +2 bytes. We can now extend table 2 to show the access patterns for a diamond search on the reference frame. Table 3 summarizes this. The shaded cells indicate patterns that will incur in cache line splits. Thus, in a best case scenario 24 out of 160 (9 points * 16 ref. rows + 16 cur. rows) memory accesses (15% of the loads) will incur in a cache line split. If more than 1 diamond search is required this statistic will increase.
Table 3. Alignment of SAD accesses for one diamond pattern
Table 4. Alignment of SAD accesses for one diamond pattern
Figure 6. Diamond points
1.1.1 Cache Associativity
With current cache architectures as described in Section 1.3, cache associativity may become an issue if the frame width is a power of 2. Let’s take for example a frame width of 1024. There are 32 possible cache sets in 1st level cache. Assuming cache aligned frames, each new row in a macro-block will be 1024 bytes apart from the last one, which means that the mapping of cache lines to 1st level cache is as presented in Table 4.
As can be observed, due to lack of ways, row 8 will swap row 0 out of 1st
level cache. This means that on every access to the current macro-block there will be an added penalty since the data will come from 2nd
level cache. Although for typical video frame widths this will not be a problem, it wouldn’t be surprising to encounter implementations where the frames get padded to powers of 2 to allow efficient computation of other transforms, i.e. computing FFT’s or other efficient power of 2 transforms.
In general, the cache position (set and way) can be computed as:
is the closest integer that is less than or equal to x (floor operator). Width
is the frame width, CLS
is the cache line size, Num_Ways
is the number of ways, Num_Sets
is the number of sets which is equal to (cache size)/(Num_Ways
) and Row is the row being indexed. Using the above equations and a cache size of 8KB any power of 2 frame width greater than or equal to 1024 will incur in cache thrashing within one macro-block.
Table 5. Cache utilization on frame width of size 1024
One solution can be implemented as a software workaround with significant speed-ups in motion estimation performance. The same solution can be implemented in hardware with expected superior performance. The method has been nominated as Search Support Copy (SSC) and is described in detail.
Search Support Copy – Software Solution
In the proposed method, intermediate software buffers are used to store a copy of one or more of the current and reference frame macro-blocks. A copy to the buffers occurs prior to executing the Motion Estimation step (see Figure 7). The characteristic of the buffers are: The advantages of using the intermediate buffers are several: it alleviates cache-line split penalty and cache thrashing issues as discussed in Section 0, since the penalties are incurred only once during the copy. This method also eliminates the occurrence of aliasing conflicts as the buffers can be guaranteed to be allocated in areas of memory that are not modulo of the address aliasing strides.
Figure 7. High level overview of Search Support Copy
Figure 8. Search Support buffer size. The diagram shows the center diamond and 4 edge diamonds. Thus, we require -6 -> 6+16 rows, and -6 -> 6+16 columns. This translates into 28 rows and 28 columns. We use a 28 rows x 32 columns buffer to ensure 16-byte alignment.
Using the guidance for the selection of the buffers, a prototype implementation was made. The current frame buffer was selected to be of size 32x16 (2 side by side macro-blocks) as copying more than two at once did not show any benefit. The reference frame buffer was selected to benefit the Diamond Search algorithm. In this case we assume that most times the algorithm will converge in two diamond iterations. Therefore we allow movement of the diamond in all directions twice. This translates into a buffer of 28x28, but to ensure 16-byte alignment, we copy an area 32 columns x 28 rows (Figure 8). Pointers are updated to keep track of the diamond movement within the buffer. Only when a diamond moves outside the buffer a new copy is made into the reference buffer.
Streaming Buffers – Hardware Solution
The general idea of having intermediate buffers can be extended to hardware. By implementing small streaming buffers or addressable caches consisting of few consecutive cache lines the same goals will be accomplished. These buffers could be addressed by explicit registers if this mode is enabled. Then copies to the registers will implicitly populate the caches. This will avoid the overhead of the software copy which may overshadow the performance benefits, especially in high motion sequences.
Other Micro-architecture Considerations
Aside from the SSC solution, various other micro-architecture modifications result in more generic and transparent improvements. These are: improve cache line split performance, improve un-aligned access performance, improve latency to 1st
level cache and adding multiple load ports. These modifications will be transparent to the programmer and will benefit all kinds of software. Other micro-architecture capacity improvements such as increasing cache size, number of ways, etc. are less likely to improve motion estimation performance.
The method described herein was implemented using a DS ME kernel. Data was collected on Northwood and Prescott using 4 video sequences ranging from low to high motion and at different resolutions (Standard Definition and High Definition).
Figure 9 and Figure 10 depict the results. The results presented illustrate two options described in this paper: copying just the current frame macro-block (labeled as Current Only in the figures) and copying both current and reference frame macro-blocks. On average, Northwood shows a 29% cycles improvement with up to 63% in low motion sequences (sequence 1). Prescott shows an average improvement of 11% and up to 40% improvement on slow motion sequences (sequence 1). In some cases the improvement is minimal and in others the “Current Frame Copy Only” version works better. A key observation that explains the relatively better improvements on Northwood as compared to Prescott is the method&rs quo;s capability to reduce 64K aliasing conflicts not present in Prescott.
This paper was intended to examine the memory access characteristics of typical motion estimation algorithms. The diamond search algorithm was used as an example and it was shown how of 15% the SAD loads can incur in cache line splits, resulting in substantial performance degradation. Also, full search algorithms incur in un-aligned accesses 99% of the time in the reference frame, while diamond search can incur in more than 67% un-aligned accesses.
A software mechanism was presented to overcome some of these micro-architecture issues that can result in significant speed-ups. However, more generic micro-architecture improvements such as improving cache-line split and un-aligned access latencies are preferred.
Figure 9. Results of the disclosed method gathered on a Pentium® 4 Northwood processor.
Figure 10. Results of the disclosed method gathered on a Pentium® 4 Prescott processor.
 Abel, J. et al. Applications Tuning for Streaming SIMD Extensions. Intel Technology Journal, 1999.
 Zhu, S. et al. A New Diamond Search algorithm for Fast Block Matching Motion Estimation. ICICS, IEEE, Sept. 1997.
 Tourapis, A. et al. Predictive Motion Vector Field Adpative Search Technique (PMVFAST). Dept. of Electrical Engineering, The Hong Kong University of Science and Technology.
About the Author
Alex A. Lopez-Estrada is a Sr. Software Engineer in Intel’s Consumer Electronics Group. He holds an MS in EE with concentration in DSP algorithm development. Alex worked for the Eastman Kodak Research labs before joining Intel, where he developed A/V codecs and media infrastructure components, performance optimization and architecture analysis. He holds 2 patents, has 9 patents pending, and has published several papers in the field of imaging, audio and video technologies.