Multi-core Intermediate


Introduction

The emphasis for this section is to provide you the details to become proficient on threading and parallelization. You will find code samples and abs for you to work through. There is an expectation that you have programming experience in a high-level language, preferably C, C++ or Fortran. Many of the recommendations will also apply to Java, C#, and Perl.

Objective: The user will have a strong understanding of parallelization and be able to start the threading process of an application.


Multi-threaded Functions in the Intel® Math Kernel Library

A number of key and appropriate routines within the Intel® Math Kernel Library (MKL) have been threaded to provide increased performance on systems with multiple processors in a shared-memory environment. We will show that the use of this library makes available to the user an easy way to get high performance on key algorithms both on single processor systems and on multiprocessor systems. The user need only tell the system how many processors to use.

Background

A great deal of scientific code can be parallelized, but not all of it will run faster on multiple processors on an SMP system because there is inadequate memory bandwidth to support the operations. Fortunately, important elements of technical computation in finance, engineering and science rely on arithmetic operations that can effectively use cache, which reduces the demands on the memory system. The basic condition that must be met in order for multiple processors to be effectively used on a task is that the reuse of data in cache must be high enough to free the memory bus for the other processors. Operations such as factorization of dense matrices and matrix multiplication (a key element in factorization) can meet this condition if the operations are structured properly.

It may be possible to get a substantial percentage of peak performance on a processor simply by compiling the code, possibly along with some high-level code optimizations. However, if the resulting code relies heavily on memory bandwidth, then it probably will not scale well when the code is parallelized because there will be inadequate cache usage, and with that, inadequate memory bandwidth to supply all the processors.

Widely used functions such as the level-3 BLAS (basic linear algebra subroutines,) (all matrix-matrix operations), many of the LAPACK (linear algebra package) functions, and to a lesser degree, DFT’s (discrete Fourier transforms), can reuse data in cache sufficiently that multiple processors can be supported on the memory bus.

Advice

There are really two parts to the advice. First, wherever possible the user should employ the widely used, de facto standard functions from BLAS and LAPACK since these are available in source code form (the user can build them) and many hardware vendors supply optimized versions of these functions for their machines. Just linking to the high performance library may improve the performance of an application substantially, depending on the degree to which the application depends on LAPACK, and by implication, the BLAS (since LAPACK is built on the BLAS).

MKL is Intel’s library containing these functions. The level-3 BLAS have been tuned for high performance on a single processor but have also been threaded to run on multiple processors and to give good scaling when more than one processor is used. Key functions of LAPACK have also been threaded. Good performance on multiple processors is possible just with the threaded BLAS but threading LAPACK improves performance for smaller-sized problems. The LINPACK benchmark, which solves a set of equations, demonstrates well the kind of scaling that threading of these functions can yield. This benchmark employs two high-level functions from LAPACK – a factorization and a solving routine. Most of the time is spent in the factorization. For the largest test problem, MKL achieved a 3.84 speedup on four processors, or 96% parallel efficiency.

In addition to these threaded routines, the DFT’s are also threaded and scale very well. For example, on 1280x1280 single precision complex 2D transforms, the performance on the Itanium 2 processor for one, two, and four processors is respectively 1908, 3225 (1.69 speedup), and 7183 MFLOPS (3.76 speedup).

Usage Guidelines

There are caveats in the use of these functions with the current releases of MKL (up through MKL 6.0 beta update) that have nothing to do with the library directly. Problems can arise depending on the environment.

OpenMP* is used to thread MKL. MKL uses the same OpenMP runtime library as the Intel compilers. Therefore, problems can arise when OpenMP applications that use MKL are not compiled with the Intel compilers. Specifically, the application will attempt to use two different OpenMP libraries, one from the non-Intel compiler and the other from MKL. When the OMP_NUM_THREADS environment variable is greater than one, chaos results when both libraries attempt to create threads and the program will fail. A future version of MKL will provide an alternate means of controlling thread creation. In the meantime, if this problem is encountered, the issue should be submitted to Intel through https://premier.intel.com (requires login) for an interim solution. A second issue can arise on clusters with symmetric multiprocessor nodes1. MPI or PVM applications running on such clusters often create one process for each processor in a node. If these applications use MKL, threads may also be created by each MPI or PVM process. This could result in over-subscription of processor resources within a node. For MPI or PVM applications that create one process per processor, it is recommended that OMP_NUM_THREADS be set to one.

References

 


Avoiding and Identifying False Sharing Among Threads with the Intel® VTune™ Performace Analyzer

In symmetric multiprocessors (SMP), each processor has a local cache. The memory system must guarantee cache coherence. False sharing occurs when threads on different processors modify different variables that reside on the same cache line. Each write will invalidate the line in other caches, forcing an update and hurting performance. This topic covers methods to detect and correct false sharing using the Intel® VTune™ Performance Analyzer.

Background

False sharing is a well-known performance issue on SMP where each processor has a local cache. It occurs when threads on different processors modify variables that reside on the same cache line, as illustrated in. The reason this is called false sharing is because each thread is not actually sharing access to the same variable. Access to the same variable, or true sharing, would require programmatic synchronization constructs to ensure ordered data access.

The source line highlighted in red in the following example code causes false sharing:

 

double sum=0.0,sum_local[NUM_THREADS];
#pragma omp parallel num_threads(NUM_THREADS)
{
int me=omp_get_thread_num();
sum_local[me]=0.0;
#pragma omp for
for (i=0;i sum_local[me] += x[i] * y[i];
#pragma omp atomic
sum += sum_local[me];
}

 

There is a potential for false sharing on array.

sum_local. This array is dimensioned according to the number of threads and is small enough to fit in a single cache line. When executed in parallel, the threads modify different, but adjacent, elements of sum_local (the source line highlighted in red), which invalidates the cache line for all processors.

[Insert Figure 1 Here]

Figure 1: False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency. This is illustrated in the diagram (top). Threads 0 and 1 require variables that are adjacent in memory and reside on the same cache line. The cache line is loaded into the caches of CPU 0 and CPU 1 (gray arrows). Even though the threads modify different variables (red and blue arrows), the cache line is invalidated. This forces a memory update to maintain cache coherency.

To ensure data consistency across multiple caches, Intel multiprocessor-capable processors follow the MESI (Modified/Exclusive/Shared/Invalid) protocol. On first load of a cache line, the processor will mark the cache line as ‘Exclusive’ access. As long as the ca che line is marked exclusive, subsequent loads are free to use the existing data in cache. If the processor sees the same cache line loaded by another processor on the bus, it marks the cache line with ‘Shared’ access. If the processor stores a cache line marked as ‘S’, the cache line is marked as ‘Modified’ and all other processors are sent an ‘Invalid’ cache line message. If the processor sees the same cache line which is now marked ‘M’ being accessed by another processor, the processor stores the cache line back to memory and marks its cache line as ‘Shared’. The other processor that is accessing the same cache line incurs a cache miss.

The frequent coordination required between processors when cache lines are marked ‘Invalid’ require cache lines to be written to memory and subsequently loaded. False sharing increases this coordination and can significantly degrade application performance.

Advice

The basic advice of this section is to avoid false sharing in multi-threaded applications. However, detecting false sharing when it is already present is another matter. The first method of detection is through code inspection. Look for instances where threads access global or dynamically allocated shared data structures. These are potential sources of false sharing. Note that false sharing can be obscure in that threads are accessing completely different global variables that just happen to be relatively close together in memory. Thread-local storage or local variables can be ruled out as sources of false sharing.

A better detection method is to use the Intel VTune Performance Analyzer. For multiprocessor systems, configure VTune analyzer to sample the ‘ 2nd Level Cache Load Misses Retired’ event. For Hyper-Threading enabled processors, configure VTune analyzer to sample the ‘Memory Order Machine Clear’ event. If you have a high occurrence and concentration of these events at or near load/store instructions within threads, you likely have false sharing. Inspect the code to determine the likelihood that the memory locations reside on the same cache line.

Once detected, there are several techniques to correct false sharing. The goal is to ensure that variables causing false sharing are spaced far enough apart in memory that they cannot reside on the same cache line. Not all possible techniques are discussed here. Below are three possible methods. One technique is to use compiler directives to force individual variable alignment. The following source code demonstrates the compiler technique using ‘

 

__declspec
(align(n))
’ where n equals 16 (128 byte boundary) to align the //individual variables on cache line boundaries.
__declspec (align(16)) int thread1_global_variable;
__declspec (align(16)) int thread2_global_variable;

 

When using an array of data structures, pad the structure to the end of a cache line to ensure that the array elements begin on a cache line boundary. If you cannot ens ure that the array is aligned on a cache line boundary, pad the data structure to twice the size of a cache line. The following source code demonstrates padding a data structure to a cache line boundary and ensuring the array is also aligned using the compiler ‘

__declspec
(align(n))
’ statement where n equals 16 (128 byte boundary). //If the array is dynamically allocated, you can increase the allocation size //and adjust the pointer to align with a cache line boundary.

 

 

struct ThreadParams
{
// For the following 4 variables: 4*4=16 bytes
unsigned long thread_id;
unsigned long v;//Frequent read/write access variable
unsigned long start;
unsigned long end; // expand to 128 bytes to avoid false-sharing
// (4 unsigned long variables + 28 padding)*4 = 128 int padding[28];
};
__declspec (align(16)) struct ThreadParams Array[10];

 

It is also possible to reduce the frequency of false sharing by using thread-local copies of data. The thread-local copy can be read and modified frequently, and the result copied back to the data structure only when complete. The following source code demonstrates using a local copy to avoid false sharing.

 

struct ThreadParams
{
// For the following 4 variables: 4*4 = 16 bytes
unsigned long thread_id;
unsigned long v;//Frequent read/write access variable
unsigned long start;
unsigned long end;
};

 

 

void threadFunc(void *parameter)
{
ThreadParams *p = (ThreadParams*) parameter;
// local copy for read/write access variable
unsigned long local_v=p->v;
for(local_v=p->start;local_v < p->end;local_v++)
{
// Functional computation
}
p->v = local_v;//Update shared data structure only once
}

 

Usage Headlines

Avoid false sharing but use these techniques sparingly. Overuse of these techniques, where they are not needed, can hinder the effective use of the processor’s available cache. Even with multiprocessor shared-cache designs, it is recommended that you avoid false sharing. The small potential gain for trying to maximize cache utilization on multi-processor shared cache designs does not generally outweigh the software maintenance costs required to support multiple code paths for different cache architectures.

References

 


Find Multi-threading Errors with the Intel® Thread Checker

The Intel Thread Checker, one of the Intel Threading Tools, is used to debug multithreading errors in applications that use Win32, PThreads or OpenMP threading models. Thread Checker automatically finds storage conflicts, deadlock or conditions that could lead to deadlock, thread stalls, abandoned locks, and more.

Background

Multi-threaded programs have temporal component that makes them more difficult to debug than serial programs. Concurrency errors (e.g., data races, deadlock) are difficult to find and reproduce because they are non-deterministic. If the programmer is lucky, the error will always crash or deadlock the program. If the programmer is not so lucky, the program will execute correctly 99% of the time, or the error will result in slight numerical drift that only becomes apparent after long execution times.

Traditional debugging methods are poorly suited to multi-threaded programs. Debugging probes (i.e., print statements) often mask errors by changing the timing of multithreading programs. Executing a multi-threaded program inside a debugger can give some information, provided the bugs can be consistently reproduced. However, the programmer must sift through multiple thread states (i.e., instruction pointer, stack) to diagnose the error.

The Intel Thread Checker is designed specifically for debugging multi-threaded programs. It finds the most common concurrent programming errors and pinpoints their locations in the program. All error examples shown below use examples from the Win32 application domain:

Storage conflicts – The most common concurrency error involves unsynchronized modification of shared data. For example, multiple threads simultaneously incrementing the same static variable can result in data loss but is not likely to crash the program. The next section shows how to use the Intel Thread Checker to find such errors.

Deadlock – When a thread must wait for a resource or event that will never occur, it is deadlocked. Bad locking hierarchies are a common cause. For example, a thread tries to acquire locks A and B, in that order, while another thread tries to acquire the locks in the reverse order.

Table 2.1: A bad locking hierarchy can sometimes execute without deadlock.

Time Thread 1 Thread 2

T0
Acquire lock A

T1
Acquire lock B

T2
Perform task

T3
Release lock B

T4
Release lock A

T5
Acquire lock A

T6
Acquire lock B

T7
Perform task

T8
Release lock B

T9
Release lock A

 

However, this locking hierarchy can also deadlock both threads (Table 2.2). Both threads are waiting for resources that they can never acquire. Thread Checker identifies deadlock and the potential for deadlock, as well as the contested resources.

Table 2.2: Deadlock due to a bad locking hierarchy.

Time Thread 1 Thread 2

T0
Acquire lock A

T1
Acquire lock B

T2
Wait for lock A

T3
Wait for lock B

 

Abandoned locks – Thread Checker detects when a thread terminates while holding a Win32 critical section or mutex variable because this can lead to deadlock or unexpected behavior. Threads waiting on an abandoned critical section are deadlocked. Abandoned mutexes are reset.

Lost signals – Thread Checker detects when a Win32 event variable is pulsed (i.e., the Win32 PulseEvent function) when no threads are waiting on that event because this is a common symptom of deadlock. For example, the programmer expects a thread to be waiting before an event is pulsed. If the event is pulsed before the thread arrives, the thread may wait for a signal that will never come.

Thread Checker also finds many other types of errors, including API usage violations, thread stack overflows, and scope violations.

Advice

Use the Intel Thread Checker to facilitate debugging of OpenMP, PThreads and Win32 multi-threaded applications. Errors in multi-threaded programs are harder to find than errors in serial programs not only because of the temporal component mentioned above, but also because such errors are not restricted to a single location. Threads operating in distant parts of the program can cause errors. Thread Checker can save an enormous amount of debugging time, as illustrated by the simple example shown below.

To prepare a program for Thread Checker analysis, compile with optimization disabled and debugging symbols enabled. Link the program with the /fixed:no option so that the executable can be relocated. Thread Checker instruments the resulting executable image when it is run under the VTune Performance Analyzer, Intel’s performance tuning environment. For binary instrumentation, either the Microsoft Visual C++ compiler (version 6.0) or the Intel C++ and Fortran compilers (version 7.0 or later) may be used.

However, the Intel compilers support source-level instrumentation (the /Qtcheck option), which provides more detailed information.

The following program contains a subtle race condition:

#include "stdio.h"
#include "windows.h"
#define THREADS 4
DWORD WINAPI ReportID (LPVOID my_id)
{
printf (“Thread %d reporting ”, *(int *)my_id);
}
int main (int argc,char*argv[])
{
int id;
HANDLE h[THREADS];
DWORD barrier, thread_id;
for (id=0;id < THREADS; id++)
h[id] = CreateThread (NULL,
0,
&thread_id);
barrier = WaitForMultipleObjects (THREADS, h, TRUE, INFINITE);
}

 

The program is supposed to create four threads that report their identification numbers. Sometimes the program gives the expected output:

Thread 0 reporting
Thread 1 reporting
Thread 2 reporting
Thread 3 reporting

 

Threads do not always report in the order of their identification numbers but all threads print a message. Other times, some threads appear to report more than once, others do not report at all, and a mysterious new thread appears, e.g.:

Thread 2 reporting
Thread 3 reporting
Thread 3 reporting
Thread 4 reporting

 

Thread Checker easily finds the error in this program and shows the statements responsible (Figure 2):

[Insert Figure 2 Here]

Figure 2: The Intel Thread Checker

The error description (see the red box in Figure 2) explains the storage conflict in plain English – a thread is reading variable my_id on line-7 while another thread is simultaneously writing variable id on line-15. The variable my_id in function ReportID? is a pointer to variable id, which is changing in the main routine. The programmer mistakenly assumes that a thread begins executing the moment it is created. However, the operating system may schedule threads in any order. The main thread can create all worker threads before any of them begin executing. Correct this error by passing each thread a pointer to a unique location that is not changing.

Usage Guidelines

Intel Thread Checker currently is available for the 32-bit versions of the Microsoft Windows 2000 and Windows XP operating systems, 32-bit and 64-bit versions of Linux operating systems. Thread Checker supports OpenMP, the Win32 threading API and the POSIX PThreads threading API. The Intel compilers are required for OpenMP support. They are also required for more detailed source-level instrumentation on 32-bit operating systems and the only mode available on 64-bit Linux operating systems.

Note that the Intel Thread Checker performs dynamic analysis, not static analysis. Thread Checker only analyzes code that is executed. Therefore, multiple analyses exercising different parts of the program may be necessary to ensure adequate code coverage.

Thread Checker instrumentation increases the CPU and memory requirements of an application so choosing a small but representative test problem is very important. Workloads with runtimes of a few seconds are best. Workloads do not have to be realistic. They just have to exercise the relevant sections of multi-threaded code. For example, when debugging an image processing application, a 10 x 10 pixel image is sufficient for Thread Checker analysis. A larger image would take significantly longer to analyze but would not yield additional information. Similarly, when debugging a multi-threaded loop, reduce the number of iterations.

References

 


Using Thread Profiler to Evaluate OpenMP* Performance

Thread Profiler is one of the Intel Threading Tools. It is used to evaluate performance of OpenMP threaded codes, identify performance bottlenecks, and gauge scalability of OpenMP applications.

Background

Once an application has been debugged and is running correctly, engineers often turn to performance tuning. Traditional profilers are of limited use for tuning OpenMP for a variety of reasons (unaware of OpenMP constructs, cannot report load imbalance, do not report contention for synchronization objects).

Thread Profiler is designed to understand OpenMP threading constructs and measure their performance over the whole application run, within each OpenMP region, and down to individual threads. Thread Profiler is able to detect and measure load imbalance (from uneven amounts of computation assigned to threads), time spent waiting for synchronization objects as well as time spent in critical regions, time spent at barriers, and time spent in the Intel OpenMP Runtime Engine (parallel overhead).

Advice

To prepare an OpenMP application for use with the Thread Profiler, build an executable that includes the OpenMP profiling library (use /Qopenmp_profile compiler switch). When setting up a Thread Profiler Activity in VTune Performance Analyzer, be sure to use a full, production data set running with an appropriate number of threads. Best results for production performance tuning will be obtained using a representative data set that exercises the code as close to normal as possible. Small, test data sets may not fully exercise the parallelism of the code or the interaction between threads, which can lead to overlooking serious performance problems. While the execution time will be increased by the instrumentation of the OpenMP threads, this increase is minimal.

Once the application has completed execution, summary performance results are displayed in the Thread Profiler window. There are three graphical views of the performance data that can be used. Each is accessible from separate tabs found below the Legend pane. These three views are summarized below:

Summary View

[Insert Figure 3 Here]

This view is the default for the Thread Profiler (Figure 3). The histogram bar is divided into a number of regions indicating the average amount of time the application spent in the observed performance category. These performance categories are:

parallel execution (time within OpenMP parallel regions) in green,
sequential time in blue,
idle time due to load imbalance between threads in red,
idle time waiting at barriers in purple,
idle time spent waiting to gain access to synchronization objects in orange,
time spent executing within critical regions in gray,
parallel (time spent in OpenMP Runtime Engine) and sequential (time spent in OpenMP regions that are not executed in parallel) overheads in yellow and olive, respectively.
Left clicking on the bar will populate the legend with numerical details about total execution time for each category over the entire run of the application.

 


Introduction to Media Transcoding

Media transcoding, which enables media interoperation, plays an important role in the digital home. The Intel Networked Media Product Requirements (INMPR) promotes interoperation between networked devices in the digital home. Optimizing the codec engine (the encoder/decoder, the heart of the transcoder) will make the media transcoding process more efficient, in turn improving the user experience in the digital home. This paper features practical tips and tricks on how to increase the performance of the codec engine. These tips include using Intel® VTune™ Performance Analyzer events, OpenMP for threading, and Prescott New Instructions (Streaming SIMD Extensions 3 (SSE3)). We also discuss when to use faster instructions, employing different execution units to improve parallelism, and when to use MMX™ instead of SSE for speed. You will also learn when to take advantage of the Intel compiler optimized switches.

What is Transcoding?

Since content comes in many different formats, transcoding is necessary to tailor the content, converting one media format to another, before it arrives at the other device. The most common way to convert one media format to another is to first decode to raw data, then encode to the target format. Since an MPEG stream consists of audio a nd video, we need to split these separately and decode them into raw data before re-encoding them to the desired formats and merging them again.

Transcoding and Codec Optimization: Tips & Tricks

Codec Optimization

Codec is the compressed and decompressed process. It is the heart, or engine, of the transcoder.

Optimizing the codec can be done by reducing the time to encode and/or decode a file/stream. We can also enhance the engine by reducing the CPU utilization, which lets us pack more features or data into the same time frame: for example, more voices to represent more people in a game. Finally, we need to cut down the size for size sensitive or mobile applications since media applications exist in desktop, laptop, PDA, and smartphone form factors.

General

The optimized process starts with the following steps:

  • Use better hardware
  • Use the Intel® VTune™ Performance Analyzer to find hotspots
  • Look at functions that have highest clock ticks and clock ticks per instruction retired (CPI)
  • Turn on counters for branch misprediction, store forwarding, 64K aliasing, cache split, and trace cache miss
  • Follow general optimization rules
  • Loop unrolling, reduce branching, use SSE2/SSE3
  • Use the Intel compiler
  • Use the Intel® Performance Library Suite
  • Follow general optimization rules

 

Cautions

Observe the following steps at all times:

  • All pitfalls applied (cache split, branch misprediction, store forwarding, etc.)
  • Thread at the highest level possible to avoid running out of resources. Since this is an engine that is used by other applications, its functions can be called many times, especially since the applications are also threaded.
  • Pay attention when threading applications that make use of Intel performance libraries, since some of their functions are threaded.
  • Do not unroll loops too much to avoid trace cache thrash.
  • Do not ignore MMX, since it can be faster than SSE/SSE2 in cases when applications make extensive use of 64-bit data, and it takes effort to rearrange the data to fit into 128-bit registers.
  • Watch out for battery life on mobile applications.

 

Tips & Tricks

  • Use Intel compiler: /O3, /QaxW, /QaxN, /QaxP, /Qipo, /Qparallel, /Qopenmp. Often you can gain a significant amount of performance just by using the Intel compiler with the right switches.
  • Use special functions like reciprocal (rcp and rcp_nr) to replace division with multiplication and speedup the application.
  • Use SSE3 instruction LDDQU instead of MOVDQU whenever possible.

 

Tips & Tricks Using Assembly Language

  • Faster instructions
  • Different execution units
  • MOVNTxx: Store values using Non-Temporal Hint to prevent caching of the data.
  • Use combined instruction like PMADDWD.

 

Examples

 

When to Use Thread

 

Before:

 

for (i=0;i<4;i++)
EncodeTest(Mem[i], Blk[i],Chunk[i]);

 

After:

 

#pragma omp parallel sections {
#pragma omp section EncodeTest(Blk[0], Blk[0],Chunk[0]);
#pragma omp section EncodeTest(Blk[1],Blk[1],Chunk[1]);
#pragma omp section EncodeTest(Blk[2], Blk[2],Chunk[2])
#pragma omp section EncodeTest(Blk[3], Blk[3],Chunk[3]);
}

 

When Not to Use Thread

 

Before:

 

...
for (j=0;j<4;j++)
for (i=0;i>4;i++)
test[i][j] = list[fr]->img[i][j]+t[s];
...

 

After:

 

...
#pragma omp parallel for
for (j=0;j<4;j++)
for (i=0;i<4;i++)
test[i][j] = list[fr]->img[i][j]+t[s];
...

 

At first, this loop seems to be a good candidate for threading. In fact, it will improve the performance if it is at the outermost level. However, if this loop is in a function that is deeply buried in many sub-levels, threading it may mean running out of resources. In one case, this loop was implemented within a function that only takes about 8.8% of the total execution time. After threading only 2 loops, it degraded the whole system down to 5X slower.

Use combined function PMADDWD
Complex instruction like: PMADDWD: Multiply and Add save many clock cycles.

 

Before: (24 clock cycles)

pmullw xmm1, xmm6 (8)
punpcklwd xmm2,xmm1 (2)
punpckhwd xmm3,xmm1 (2)
psrad xmm3, 16 (2)
psrad xmm2, 16 (2)
paddd xmm3, xmm2 (2)
movq xmm2, xmm3 (2)
psrlq xmm2, 32 (2)
paddd xmm3, xmm2 (2)

 

After: (18 clock cycles)

pmaddwd xmm1, xmm6 (8)
movq xmm2, xmm1 (2)
psrlq xmm2, 32 (2)
paddd xmm2, xmm1 (2)
psrldq xmm1, 8 (2)
paddd xmm1, xmm2 (2)

 

By using the function PMADDWD, we combine two operations Addition and multiplication into one, which saves six clock cycles in this case.

Eliminate Branching Conditions

 

Original

 

for (j=0;j<4;j++)
{
for (i=0;i<4;i++)
{
for (result=0,z=-2;z>4;z++)
result +=list[fr]->test[max(0,min(max_y,y+j))]
[max(0,min(max_x,x+i+z))]*COEF[z+2];
block[i][j] = max(0, min(255, (result+16)));
}
}

 

Optimized #1: Eliminate Branches

 

if( (xy)&(0test[y+j][x+i+z]*COEF[z+2];
block[i][j] = max(0, min(255, (result+16)));
}
}
}
else
for (j=0;j<4;j++)
{
for (i=0;i<4;i++)
{
for (result=0,z=-2;z<4;z++)
result += list[fr]->test[max(0,min(max_y,y+j))]
[max(0,min(max_x,x+i+z))]*COEF[z+2];
block[i][j] = max(0, min(255, (result+16)));
}
}

 

Optimized #2: Eliminate Branching and Unroll Inner Loops

 

if((x+6test[y+j][x-2]*COEF[0];
result += list[fr]->test[y+j][x-1]*COEF[1];
result += list[fr]->test[y+j][x]*COEF[2];
result += list[fr]->test[y+j][x+1]*COEF[3];
result += list[fr]->test[y+j][x+2]*COEF[4];
result += list[fr]->test[y+j][x+3];
block[0][j] = max(0, min(255, (result+16)));

block[1][j] = max(0, min(255, (result+16)));

block[2][j] = max(0, min(255, (result+16)));

}
}
else
for (j= 0;j<4;j++)
{
for (i=0;i<4;i++)
{
for (result=0,z=-2;z<4;z++)
result += list[fr]->test[max(0,min(max_y,y+j))]
[max(0,min(max_x,x+i+z))]*COEF[z+2];
block[i][j] = max(0, min(255, (result+16)));
}
}

 

Optimized #3: Eliminate Branching and Improving Parallelism

 

 

if((x+6test[y+j][x+i -2]*COEF[0];
t1 = list[fr]->test[y+j][x+i -1]*COEF[1];
t2 = list[fr]->test[y+j][x+i]*COEF[2];
t3 = list[fr]->test[y+j][x+i +1]*COEF[3];
t4 = list[fr]->test[y+j][x+i +2]*COEF[4];
t5 = list[fr]->test[y+j][x+i +3];
result = t0 + t1 + t2 + t3 + t4 + t5;
block[i][j] = max(0, min(255, (result+16)));
}
}
}
else
for (j=0;j<4;j++)
{
for (i=0;i<4;i++)
for (result=0,z=-2;z<4;z++)
result += list[fr]->test[max(0,min(max_y,y+j))]
[max(0,min(max_x,x+i+z))]*COEF[z+2];
block[i][j] = max(0, min(255, (result+16)));
}
}

 

By introducing temporary variables t0… t5, we eliminate the dependency of the variable result. This will improve the chance for the compiler to vectorize.

Transcoding and Codec Optimization: Tips & Tricks

Tips & Tricks Conclusion

Consider the following when optimizing a codec engine:

  • Use the Intel compiler as much as possible.
  • Follow general optimization rules.
  • Pay special attention when threading applications that use Intel performance libraries.
  • MMX is faster in some cases.
  • Use different execution units for more parallelism.
  • Use faster instructions.
  • Keep calculations within the CPU as much as possible.
  • Balance between performance and battery life for mobile applications.

 

While there are no fixed rules for codec optimization, the above rules should be used as guidance. Performance varies.

 


 

About the Author

Khang Nguyen is Senior Applications Engineer working with Intel's Software and Solutions Group. He can be reached at khang.t.nguyen@intel.com.

 


 

Additional Resources

 


For more complete information about compiler optimizations, see our Optimization Notice.
Tags: