Optimization Techniques for Intel® Multi-Core Processors

By Lerie R Kane,

Published:09/21/2011   Last Updated:09/21/2011

Introduction

Intel® Software Development Products Help Developers Program and Optimize for Multi-Core Intel® Architecture Processors.

Intel has long been delivering microprocessor architectures with ever increasing performance. Now, Intel’s solutions feature a platform architecture that offers increasing processing capability while addressing the constraints of both footprint and power consumption. With innovations that move beyond clock speed to offer integrated platform solutions, Intel is a leader in the industry in enabling new generations of communications and embedded computing applications.

The introduction of multi-core processors from Intel provides developers with an opportunity to scale performance while optimizing power consumption. To fully exploit this opportunity, developers must understand the inherent parallelism in their applications. Intel provides development tools and technical information to assist the developer in maximizing performance on multi-core platforms.

This paper provides an overview of parallelism constructs and programming techniques, focusing on common threading issues and performance tuning.


Taking Advantage of Parallelism

To assist the developer in identifying opportunities for parallelism within the application, Intel offers the Intel® VTune™ Performance Analyzer.

The Intel VTune Performance Analyzer is a performance analysis tool that utilizes hardware interrupts to give the developer a true picture of how an application is performing. Available on Microsoft Windows* and various flavors of Linux*, this is a great tool for focusing in on the performance-intensive sections of an application. There are two technologies in this tool that are useful when analyzing code for threading opportunities: sampling and call graph.

While the Intel VTune Performance Analyzer provides the ability to sample on nearly every performance counter available in the processor, focusing on clockticks enables one to understand how much time is being spent in each function of your program. Once the most time-consuming functions are identified, drill-down to the source code to determine whether threading can be effectively implemented. Some resource-intensive functions may not lend themselves to parallel execution. If you find yourself faced with a hot spot that cannot be threaded, the Intel VTune Performance Analyzer call graph technology is the next step. Call graph graphically depicts the call tree through an application. Even when your hot spot is not amenable to threading, this technology may be able to identify a function further up the call tree that can be threaded. Threading a function further up the call tree will improve performance by allowing multiple threads to call to the hot function simultaneously.

Effective Parallelism
The implementation of parallelism in a system can take many forms; one commonly used type is shared memory parallelism which implies the following:

  • Multiple threads execute concurrently.
  • The threads share the same address space. This is compared to mu ltiple processes which can execute in parallel but each with a different address space.
  • Threads coordinate their work.
  • Threads are scheduled by the underlying operating system and require OS support

 

To illustrate the keys to effective parallelism, we choose an example in the physical world, that of multiple workers mowing a lawn. The first consideration is how to divide the work evenly. This even division of labor has the effect of keeping each worker as active as possible. Second, the workers should each have their own lawn mower; not doing so would significantly reduce the effectiveness of the multiple workers. Finally, access to items such as the fuel can and clipping container needs to be coordinated. The keys to parallelism illustrated through this example are generalized as follows:

  • Identify the concurrent work.
  • Divide the work evenly.
  • Create private copies of commonly used resources.
  • Synchronize access to unique shared resources.

 

Three classifications of parallel technologies are thread libraries, message passing libraries, and compiler support. Thread libraries, such as POSIX* threads and Windows* API threads, enable very explicit control of threads; if you require fine management of threads you may consider using these explicit threading technologies. Message passing libraries such as Message Passing Interface (MPI) enable one application to take advantage of several machines that do not necessarily share the same memory space. MPI is commonly used in the scientific computation arena. The third technology is threading support enabled in the Intel® Compilers in the form of OpenMP* and automatic parallelization.


Intel C++ Compiler 9.0 for Linux*

The Intel® C++ Compiler is an optimizing compiler offered on several operating systems including Microsoft Windows* and Linux*, and on several architectures, IA-32, Intel® Itanium®, and systems with Intel® EM64T. The Intel Compiler conforms to the C and C++ languages and offers binary compatibility with GNU Compiler Collection (GCC). The strongest advantage of the Intel Compiler is its optimization technology and performance feature support which includes OpenMP and automatic parallelization. For further information on the Intel Compiler, please consult the product Web page.

OpenMP* is a portable, shared memory multiprocessing application program interface supported by multiple vendors on several operating systems and under the following programming languages, Fortran 77, Fortran 90, C, and C++. OpenMP simplifies parallel application development by hiding many of the details of thread management and thread communication behind a simplified programming interface. Developers specify parallel regions of code by adding pragmas to the source code. In addition, these pragmas communicate other information such as properties of variables and simple synchronization.

#include "stdio.h&"

#include "omp.h"

static int num_steps = 100000; 

double step;

#define NUM_THREADS 2

int main ()

{
	
int i;

double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;
	
omp_set_num_threads(NUM_THREADS);

#pragma omp parallel for reduction(+:sum) private(x)

for (i=0;i< num_steps; i++){

x = (i+0.5)*step;

sum = sum + 4.0/(1.0+x*x);

}

pi = step * sum;

printf(“%lf”, pi);

}

 

Figure 1. Sample OpenMP code showing the use of pragma and library function

Figure 1 is a sample OpenMP program that calculates the value of pi by summing the area under a curve. As you can see, the program is very similar to the original serial version of the code except for the addition of a couple of lines of code. The key line is the following pragma, #pragma omp parallel for reduction (+:sum) private (x) which specifies the following for loop should be executed by a team of threads, temporary partial results represented by the sum variable should be aggregated at the end of the parallel region by addition, and finally the variable x is private, meaning each thread gets its own private copy. The keys in parallelizing the code are summed up as follows:

  • Identify the concurrent work. The concurrent work is the area calculation encompassing different parts of the curve.
  • Divide the work evenly. The number of rectangle areas to compute is 100000 and is equally allocated between the threads.
  • Create private copies of commonly used resources. The variable x needs to be private as each thread’s copy will be different.
  • Synchronize access to unique shared resources. The only shared resource, step, does not require synchronization in this example because it is only read by the threads, not written.

 

Automatic Parallelization, which is also called auto-parallelization, analyzes loops and creates threaded code for the loops determined to be beneficial to parallelize. Automatic parallelization is a good first technique to try in parallelizing your code as the effort to do so is fairly low. The compiler will only parallelize loops that can be determined to be safe to parallelize. The following tips may improve the likelihood of successful parallelization:

  • Use the optimization reporting option. The parallelization optimization report (-par report on Linux) provides a summary of the compiler’s analysis of every loop and in cases where a loop cannot be parallelized, a reason as to why not. This is useful in that even if the compiler cannot parallelize the loop, the developer can use the information gained in the report to identify regions for manual threading.
  • Expose the trip count of loops whenever possible. The compiler has a greater chance of parallelizing loops whose trip counts are statically determinable.
  • Avoid placing function calls inside loop bodies. Function calls may have effects on the loop that cannot be determined at compile time and may prevent parallelization.
  • Adjust the threshold needed for auto-parallelization. The compiler estimates how much computation is occurring inside of the loop and if it determines the amount is too small, parallelization may not occur. This can be overridden by the threshold option (par threshold on Linux).

 

icc –parallel –par-report3 pi.c

pi-1.c(11): warning #161: unrecognized #pragma

#pragma omp parallel for reduction(+:sum) private(x)

^



procedure: main

parallel loop: line 12

shared     : { "num_steps" }

private    : { "i" "x" }

first priv.: { "step" }

reductions : { "sum" }



pi-1.c(12) : (col. 2) remark: LOOP WAS AUTO-PARALLELIZED.

 

Figure 2. Compiler log of compiling with auto-parallelization

Figure 2 shows the results of compiling the code in Figure 1 with the auto-parallelization option. Compilation and execution of the OpenMP enabled code works successfully using only auto-parallelization, pointing out the fact that auto-parallelization using the Intel Compiler uses the same underlying libraries as the OpenMP implementation. For example, the call to omp_set_num_threads resolves correctly with auto-parallelization even though this function is defined by the OpenMP API.


Correctness and Performance of Parallel Code

Once threading has been added to an application, the developer is potentially faced with a new set of programming bugs. Many of these are difficult to detect and require extra time and care to ensure a correctly running program. A few of the more common threading issues will be covered in this paper, including:

  • Data Race
  • Synchronization
  • Thread Stall
  • Deadlock
  • False Sharing

 

A data race occurs when two or more threads are trying to access the same resource at the same time. If the threads are not communicating effectively, it is impossible to know which thread will access the resource first. This leads to inconsistent results in the running program. For example, in a read/write data race, one thread is attempting to write to a variable at the same time another thread is trying to read the variable. The thread that is reading the variable will get a different result depending on whether the write has already occurred. The tricky thing about a data race is that it is non-deterministic. A program could run correctly one hundred times in a row, but when moved onto the customer’s system, which has slightly different system properties, the threads do not line up as they did on the test system and the program fails.

Th e way to correct a data race is with synchronization. One way to synchronize access to a common resource is through a critical section. Placing a critical section around a block of code alerts the threads that only one may enter that block of code at a time. This ensures that threads will access that resource in an organized fashion. Synchronization is a necessary and useful technique, but care should be taken to limit unnecessary synchronization as it will slow down performance of the application. Since only one thread is allowed to access a critical section at a time, any other threads needing to access that section are forced to wait. This means precious resources are sitting idle, negatively impacting performance.

Another method of ensuring shared resources are correctly accessed is through a lock. In this case, a thread will lock a specific resource while it is using that resource, which also denies access to other threads. Two common threading errors can occur when using locks. The first is a thread stall. This happens when you have one thread that has locked a certain resource and then moves on to other work in the program without first releasing the lock. When a second thread tries to access that resource it is forced to wait for an infinite amount of time, causing a stall. A developer should ensure that threads release their locks before continuing through the program. A deadlock is similar to a stall, but occurs when using a locking hierarchy. If, for example, Thread 1 locks variable A and then wants to lock variable B while Thread 2 is simultaneously locking variable B and then trying to lock variable A, the threads are going to deadlock. Both are trying to access a variable that the other has locked. In general, you should avoid complex locking hierarchies, if possible, and ensure that locks are acquired and released in the same order.

The final issue this paper will cover is false sharing. This is not necessarily an error in the program, but something that is likely to affect performance. False sharing occurs when two threads are manipulating data that lie on the same cache line. When one thread has changed data on that line of cache it causes the cache to become invalidated. The second thread will have to wait while the cache is reloaded from memory. If this happens repeatedly, for example inside of a loop, it will severely affect performance. One way to detect false sharing is to sample on L2 cache misses using the Intel VTune Performance Analyzer sampling technology. If this event occurs frequently in a threaded program, it is likely that false sharing is at fault.

Intel® Thread Checker
Debugging threaded programs may seem to be a large burden, but it can be easier for developers using the Intel Thread Checker. This tool is available as a plug-in to the Intel VTune Performance Analyzer and detects threading errors while your program is running. It then displays the errors and correlates them to the offending lines of source code. One of the great features of the Thread Checker is that an error does not have to occur in order for it to be detected. For example, as mentioned earlier, data races are non-deterministic making them very difficult to detect. The Thread Checker will pinpoint where a data race can possibly occur even if the code happened to execute correctly while the tool was examining it. Key to effectively using Thread Checker is ensuring good code coverage when running the program. Thread Checker cannot detect an error if the region of code containing the error is never executed, so it is important to make sure all functions in the program are exercised. For more information on using the Intel Thread Checker, please visit the product Web page.

Intel® Thread Profiler
Once correctness issues are solved, performance tuning can occur. The Intel Thread Profiler is a tool that leverages the instrumentation technology of the Intel VTune Performance Analyzer to aid in the tuning of applications threaded using OpenMP, Windows API, or POSIX threads. The tool lets you visually inspect the performance of your threads to answer questions such as:

  • Is the work evenly distributed between threads?
  • How much of the program is running in parallel?
  • How does performance increase as the number of processors employed increases?
  • What is the impact of synchronization between threads on execution time?

 

The answer to these questions can help you optimize your application further. For example, if you determined that the workload was not balanced evenly between threads, you could implement code changes and iteratively test the application until you have confirmed a balance. If synchronization time was observed to be excessive, you could analyze your code to see how to simplify or safely remove some of the synchronization. Techniques for doing so are outside of the scope of this paper; the main point is that Thread Profiler is the tool that allows you to monitor the effects of your optimization as you tune. For further details, please consult the guide Getting Started with the Thread Profiler (PDF 151KB).


Conclusion

The semiconductor is moving to multi-core processors to deliver performance headroom. In order to take advantage of these performance gains, increasing the parallelism of application software is recommended. The best way to extract the full potential out of a multi-core processor is through threading. The software tools that Intel has created can ease this transition, helping to ensure that your application can be optimally tuned for the hardware that powers it.


Additional Resources

 


About the Authors

Max Domeika is a senior staff software engineer in the Software Products Division at Intel, creating software tools targeting the Intel Architecture market. Over the past 9 years, Max has held several positions at Intel in compiler development including project lead for the C++ front end and developer on the optimizer and the IA32 code generator. He currently provides technical consulting for a variety of products targeting Embedded Intel® Architecture and also provides software tools training serving as an instructor with the Intel® Academic Community. He earned a B.S. in C omputer Science from the University of Puget Sound and an M.S. in Computer Science from Clemson University.

Lerie Kane is a Technical Marketing Engineer at Intel, specializing in software tools for the Embedded Intel Architecture market. Lerie works directly with software developers on optimizing their applications for Intel® platforms and also provides training to developers on the latest Intel® technologies. She has a B.S. in Computer Science from Portland State University and a Masters in Business from Arizona State University.


Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804