Intel® Tools Make Threading Easier on Multiple Processors

Introduction

Intel® Thread Profiler, Intel® Thread Checker, and the Intel® Compiler with Support for OpenMP* Allow Quick Performance Estimation for Threading Applications.

As a software developer wanting to take advantage of multi-core processors, you are faced with the challenge of determining whether or not threading will improve performance, be worth the effort, or even be possible.

The Intel® compiler with support for OpenMP*, and the threading tools Intel® Thread Profiler and Intel® Thread Checker, allow you to quickly estimate the performance of threading your application on two, four, or more processors and help you specifically pinpoint places in your code where data needs to be protected in support of threading. All of these evaluations can be performed in your code with straightforward compiler-supported OpenMP pragmas.

These tools can run your code in single-threaded mode and estimate how your code would run on actual multi-core or multiprocessor systems without actually threading the code in advance. This method of evaluation using OpenMP with Intel Thread Profiler and Intel Thread Checker is called “thread count independent mode,” and it can be a quick and powerful technique to help estimate threading performance and implementation tradeoffs.

In addition, development of parallel code can be performed on a laptop or other computer system with fewer cores than the target system while still obtaining scalability estimates for multi-core processors for these systems. This article discusses how to use these tools to perform this analysis.


Thread Count Independent Mode in Intel® Thread Profiler and Intel® Thread Checker

Thread count independent mode in Intel® Thread Checker is used when analyzing programs that are compiled with the "/Qopenmp /Qtcheck" options and then analyzed using Intel Thread Checker. An important caveat for thread count independent mode is that programs may not explicitly control (or depend) on the thread count for operation. Thread count independent mode using Intel® Thread Profiler is a bit trickier, and in cases when you are developing on a system with multiple processors or cores, you must limit the OpenMP* thread count to 1 for correct operation. More about this will be explained in the following paragraphs.

In the case of Thread Checker with the /Qopenmp /Qtcheck compiler options, code runs with OpenMP automatic parallelization pragmas in serial mode, identifying potential data conflicts as if the program were running in parallel. This means that actual data races, deadlocks, and other data parallelism issues will not actually occur during the simulated parallel run, but instead are detected and reported as if the program ran in parallel. In general, this works with parallel for and other data decomposition pragmas and for functional decomposition parallel section pragmas, but not for taskq or nested parallelism. See the OpenMP documentation included in the Intel compiler for more details on these pragmas. Even though the program is being run in serial mode, you do not need to set the OMP_NUM_THREADS environment variable to 1 with Intel Thread Checker to use thread count independent mod e. In fact, observing that your code is only running on a single thread or core on a parallel machine is a good way to verify that you have triggered thread count independent mode.

When using Intel Thread Profiler with the /Qopenmp_profile option in thread count independent mode, code with OpenMP automatic parallelization pragmas is simulated as if it is running in parallel, although the application is actually run serially with a couple of important caveats: You must run Intel Thread Profiler with a single OMP thread in the configuration dialog explicitly in your code using omp_set_num_threads(). As with Intel Thread Checker, simple OpenMP programming constructs are most likely to ensure that you will run in thread count independent mode. In particular, taskq or nested parallelism options are not supported in thread count independent mode. Calling functions such as omp_set_num_threads(), omp_get_num_threads(), omp_get_max_threads(), omp_get_thread_num(), and omp_get_num_procs() can also potentially affect the code being evaluated such that it is no longer thread count independent, meaning again that the code depends implicitly on the number of threads that are indicated. The best advice is to keep your use of OpenMP simple when using this mode since it is primarily for threading scalability assessment and not actual threading. The next two sections describe in more detail how you can perform this analysis.


Estimating Scaling & Performance of Threading Applications Using OpenMP* and Intel® Thread Profiler

With that background, let’s first illustrate estimating scaling of application threading using OpenMP* and Intel® Thread Profiler with a simple example. Again, the power of this method is that code doesn’t actually have to be threaded or thread safe to evaluate; different threading methods and models can be estimated using this technique in a simple, fast and effective way before the actual work of threading your application. This method essentially evaluates what portions of the code are parallel and which are serial, and then, using Amdahl’s law, calculates the potential scaling of the code if it were to run in parallel. Note again that the code must not actually be run in parallel, which can be guaranteed by setting the thread count to 1 in the Advanced Activity Configuration dialog on multi-processor systems, or by evaluating the code on a single logical processor system.

To support instrumentation for Thread Profiler using thread count independent mode, you must use at least version 8.0 or later of the Intel compiler. To begin, place OpenMP automatic parallelization pragma statements in the appropriate places to simulate where your code would potentially run in parallel. As a very brief review of basic OMP programming, to support data decomposition in a for loop, use the #pragma omp parallel for statement. To support functional decomposition, use parallel section statements. For information on these programming statements and other general documentation about OpenMP programming with the Intel compiler, please refer to the compiler documentation or the references to articles at the end of this article. Next, build your application with the /fixed:no linker option and the /Qopenmp_profile compiler command line option.

Let's now consider a straightforward example. Shown below is some simple code to calculate prime numbers given a range of integers as input. This sample is taken from a previous article on OpenMP programming, authored by Clay Breshears, and is a simple program which is used only to illustrate the basic concepts in this article.

#include <math.h>

#include <stdlib.h>

#include <stdio.h>



int main(int argc, char* argv[])

{

int i, j;

int start, end;          /* range of numbers to search */

int number_of_primes=0;  /* number of primes found */

int number_of_41primes=0;/* number of 4n+1 primes found */

int number_of_43primes=0;/* number of 4n-1 primes found */

int prime, limit;     /* is the number prime? */

int print_primes=0;      /* should each prime be printed? */



start = atoi(argv[1]);

end = atoi(argv[2]);

if (!(start % 2)) start++;



if (argc == 4 && atoi(argv[3]) != 0) print_primes = 1;

printf("Range to check for Primes: %d - %d ",start, end);



for(i = start; i <= end; i += 2) {

limit = (int) sqrt((float)i) + 1;

prime = 1; /* assume number is prime */

j = 3;

while (prime && (j <= limit)) {

if (i%j == 0) prime = 0;

j += 2;

}



if (prime) {

if (print_primes) printf("%5d is prime ",i);

number_of_primes++;

if (i%4 == 1) number_of_41primes++;

if (i%4 == 3) number_of_43primes++;

}

}



printf("Program Done. %d primes found",number_of_primes);

printf("Number of 4n+1 primes found: %d",number_of_41primes);

printf("Number of 4n-1 primes found: %d",number_of_43primes);

return 0;

}

 

Since this code contains a for loop, we need only add a simple OpenMP for loop pragma to the code to evaluate the potential performance of this code if it were to be threaded and run in parallel. Specifically we only need to add the statement #pragma omp parallel for on line 21.

#pragma omp parallel for

for(i = start; i <= end; i += 2) {

 

After adding this line of code and c ompiling with the settings as outlined above, we are ready to evaluate potential scaling using Intel Thread Profiler. Create a new Intel Thread Profiler project in Intel® VTune™ using the sample application: in the configuration wizard include 1 and 500000 in the command line arguments box and 1 in the Number of Threads box to ensure that the code won’t actually run in parallel. Since we haven’t yet made this program thread safe, we need to ensure that the code doesn’t actually run in parallel. On a multi-processor system (HT, dual-core or dual processor) it is important to keep the value of number of threads to 1 in order to ensure that the code runs serially. Note that because the scalability graph is limited to 2 times number of threads, and we have explicitly set Number of Threads to 1, the Whole Program Estimated Speedups scalability graph on a multi-processor system will be limited to scalability estimates using 2 threads. Now run the application with Intel VTune Thread Profiler and click on the summary tab. Your output should appear something like what is shown in the next screenshot:



Notice the window on the right-hand side titled Whole Program Estimated Speedups. This window indicates the potential scaling of this code with 2 threads on a multi-core system. Note that in this case, the speedup curve in green shows scaling with 2 threads that appears to be near ideal, indicating that our chosen method of threading would be very effective. Your application may have different scalability targets than this admittedly small example, but it effectively illustrates the power of estimating potential scalability before the actual work of threading has begun. Using Thread Profiler in thread count independent mode can be a tremendous help in estimating potential performance gains of threading your application.


Determining Threading Data Errors and Implementation Tradeoffs Using OpenMP* and Intel® Thread Checker

If you have determined that an implementation appears to show good performance potential in terms of scaling, OpenMP* and Intel® Thread Checker can help you fix any potential errors related to threading before actually running any of the code in parallel on multiple threads. First remove the /Qopenmp_profile statement and replace it with “/Qopenmp /Qtcheck” making sure to leave the /fixed:no link option. Create a new project in Intel® VTune™ Thread Checker with the sample code as before, however this time select a smaller set of input data (perhaps 1 to 50000) since code instrumented with thread checker runs much more slowly than normal. Running the sample application in this manner produces the following result:




The Intel VTune output window indicates several issues that need to be resolved with the variables limi t, prime, j, number_of_primes, number_43primes and number_of_41primes. These issues are fairly easily addressed, some by moving their declarations to the inside of the for loop, others that add to a sum at the end of computation can be easily added to an OpenMP reduction statement. See the original article containing this code sample for a detailed discussion of those changes and why they were made. The final code sample at the end of this article contains all of the changes necessary to make the code thread safe. After these changes are made, the code is thread safe and ready to be run on a multi-core, multiprocessor or hyper-threading technology enabled system for testing. The real power of this method is that you may use it independent of support for multiple processors or systems where the number of target threads for your application is much greater, determining threaded programming issues independent of the underlying platform. Using Intel Thread Checker in this manner can help you specifically identify potential data race conditions and other parallel programming issues at run-time before the code ever runs in parallel by simulating the execution of parallel code. Intel Thread Checker gives you another valuable tool to enhance your ability and effectiveness to thread applications.


Conclusion

Your threading efforts can be greatly benefited by iteratively employing these techniques to investigate potential threading alternatives and also by using Intel® tools to determine potential data errors related to threading. In this article we have demonstrated how you can use OpenMP* in thread count independent mode with Intel® Thread Profiler to estimate threaded application performance and evaluate threading performance tradeoffs. We also demonstrated how to use thread count independent mode and with Intel® Thread Checker to specifically identify data in your code that will need to be protected in your threading implementation, all without actually running the code in parallel or performing any of the actual work of threading. Using these tools as demonstrated in this article can help reduce the burden of evaluating potential benefits of threading and what data protection needs to be implemented in a potential threading implementation. These tools can be a great help to you in taking advantage of concurrency in current and future Intel platforms.

Prime.c – program to calculate all prime numbers in a range of inputs with corrections to allow correct threaded operation

#include <math.h>

#include <stdlib.h>

#include <stdio.h>



int main(int argc, char* argv[])

{

int i;

int start, end;          /* range of numbers to search */

int number_of_primes=0;  /* number of primes found */

int number_of_41primes=0;/* number of 4n+1 primes found */

int number_of_43primes=0;/* number of 4n-1 primes found */

int print_primes=0;      /* should each prime be printed? */



start = atoi(argv[1]);

end = atoi(argv[2]);

if (!(start % 2)) start++;
17

if (argc == 4 && atoi(argv[3]) != 0) print_primes = 1;

printf("Range to check for Primes: %d - %d",start, end);



#pragma omp parallel for schedule(dynamic,100) 	

reduction(+:number_of_primes,number_of_41primes,number_of_43primes)

for(i = start; i <= end; i += 2) {

int prime, limit, j;

limit = (int) sqrt((float)i) + 1;

prime = 1; /* assume number is prime */

j = 3;

while (prime && (j <= limit)) {

if (i%j == 0) prime = 0;

j += 2;

}



if (prime) {

if (print_primes) printf("%5d is prime",i);

number_of_primes++;

if (i%4 == 1) number_of_41primes++;

if (i%4 == 3) number_of_43primes++;

}

}



printf("Program Done. %d primes found",number_of_primes);

printf("Number of 4n+1 primes found: %d",number_of_41primes);

printf("Number of 4n-1 primes found: %d",number_of_43primes);

return 0;

}

 


Additional Resources

 


Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.