Intel® Tools for Thread-Oriented Development on Linux*



Intel’s support for Linux* threading has expanded significantly due to a pair of acquisitions. The company’s line-up of development products is now one of the most comprehensive toolsets available for the Linux platform.

By Andrew Binstock

During the early years of Linux*, when it was gaining traction through the work of evangelists, the operating system did not offer significant functionality in support of threading. In part, this was due to a unique threading model initially supported by Linus Torvalds, the driving force behind the Linux project. As the need for a more-standardized threading model emerged, Torvalds accepted a new model that more closely resembled commercial server operating systems such as UNIX. During the last few years, Torvalds and the keepers of the Linux flame have added significant improvements to these threading operations. One of these was the adoption of the Pthreads API. This specification was designed by a POSIX committee, which was part of a larger effort to standardize most UNIX interfaces. As a result, Pthreads is the primary API for many versions of UNIX, in addition to Linux. (A version of Pthreads is also available for Windows*, although it’s not a part of the Windows API. See*)
When version 2.4 of the Linux kernel was released in 2001, it added considerable threading functionality. The new features made the operating system multiprocessor-capable for systems with only a handful of processors. Release of version 2.6 of the kernel saw Linux threading take a big step forward by vastly increasing the number of processors it could support and completely redesigning the thread scheduler. The result of these changes is an operating system with greatly expanded scalability that presently has the capability of running truly enterprise-scale applications.
As Linux was improving its support for threads, Intel was expanding its support for threaded Linux applications. Nowhere has Intel’s support been more evident than in the set of thread-oriented development tools the company is releasing. This article discusses those tools as well as existing products and how they benefit Linux developers.

Intel’s Threading Tools for Linux*

Intel® Compilers provide the first tier of support for threading. Naturally, they support Pthreads. But more importantly, the compilers support a portable threading interface called OpenMP. OpenMP is a threading specification designed by a hardware vendor collective known as the OpenMP Architectural Review Board (ARB). Its goal is to provide a simple, portable way to thread programs. The OpenMP standard (*), now in its second major release, consists of library functions, environment variables, and pragmas. While the functions and environment variables can be used to tune the runtime environment of OpenMP programs, the real heart of OpenMP is the set of pragmas (which are called directives in Fortran).

Pragmas, for readers who have not used them for a while, are compiler-specific commands that can be embedded in source code. If a compiler does not recognize a pragma, it simply ignores it. OpenMP uses pragmas to tell the compiler to thread a loop, for example. The code to do this in C/C++ appears in the following code listing.

. . .
#pragma omp parallel for
for ( j = 0; j < 10000; j++ )
	a[j] = b[j] + c[j];


Figure 1. A for-loop parallelized in OpenMP via a pragma

This pragma tells the compiler to generate code that behind the scenes will create a group of threads, distribute the work of the loop across those threads, and then manage the threads once the loop completes. The code generates the “optimal” number of threads for the run-time system-a number that is often equal to the number of available execution pipelines. OpenMP provides similar pragmas to parallelize other parts of programs.

In addition to this radical simplicity, OpenMP has the benefit that it does not require code to be changed to run in single-threaded mode for debugging purposes. If the OpenMP pragmas are simply disabled, the code can run as a single thread. If, instead, a developer wants to debug code running with a specific number of threads, this configuration can be specified using a combination of API calls and environmental variables. In sum, OpenMP provides ease of use, portability, and debugging flexibility.

However, to use OpenMP, a compiler that supports it is necessary. On PCs, only a few OpenMP compilers are available. Intel provides an advanced optimizing compiler that supports OpenMP on Linux (and Windows) for C/C++ and FORTRAN. To ease integration, the C/C++ compilers are command-line compatible with and use the same binary and debug formats as the most popular compilers on the respective platforms (gcc for Linux and Microsoft Visual C++ for Windows).

While OpenMP’s advantages are considerable, it does not provide fine-grained control of threads. To work at the lower levels, Linux developers use the Pthreads API discussed previously. When working with Pthreads, these developers will find Intel’s other threading tools to be especially useful. These tools include program analysis and performance-profiling products.

Intel® Thread Checker v. 2.1

The Linux version of Intel® Thread Checker provides useful data about thread operations for Pthreads and OpenMP. Intel Thread Checker detects classic situations that bedevil developers doing parallel programming. These include: deadlocks (in which two threads are each waiting on the other), race conditions (two threads accessing the same data field simultaneously), thread stack overflows, and other difficult-to-reproduce infelicities. Intel Thread Checker uses performance instrumentation to collect the thread run-ti me information on the Linux system. It then analyzes the data and displays it on a Windows system. For Linux users, the package comes with the Windows product and the remote data collector for deployment on the Linux platform. Intel Thread Checker works with binaries generated by either gcc or the Intel Compilers. A separate version is available for Linux applications running on Intel’s 64-bit enterprise chip, the Itanium® processor.

Intel® Thread Profiler v. 2.1

This product measures a thread’s execution performance. In particular, it measures thread overhead and the impact of synchronization on program performance. The problem of measuring thread performance is particularly difficult, because threads interact in complex ways and are frequently being stopped and started, depending on the specific task they’re working on. Measurement of synchronization delay is also a thorny task. In our recent book for Intel Press, Programming with Hyper-Threading Technology: How to Write Multithreaded Software for Intel® IA-32 Processors. Rich Gerber and I discuss this problem in some depth. The challenge is recording delays that are very brief. It’s not the measurement that is difficult, but the recording of the data. If not done exactly right, there can be numerous moments when writing out this information will take far longer than the delay itself.

When the recording takes nearly as much time as the event itself, there arises an effect reminiscent of Heisenberg’s uncertainty principle (which loosely articulated says that due to the inherent nature of measurement, the more accurately you know a particle’s location, the less accurately you can know its velocity. Likewise, the more accurately you know its velocity, the less accurately you can know its location). In this case, the more accurately you measure the delay of any single thread, the less accurately you can measure overall program performance. For example, if several threads are waiting for access to a locked portion of the program, the delay caused by recording wait data for the first thread when it’s finally granted access is duplicated in the wait data of all other waiting on the same access. These recording delays compound thread latencies and provide a wholly inaccurate profile of performance execution.

The Thread Profiler addresses this problem by running as a separate process that performs numerous time samplings of the threads’ execution. From this data, it is able to obtain fairly high-resolution execution profiles of individual threads. Likewise, the product can draw a critical path diagram that shows the principal thread operating at any point in the program’s execution.

As with the Intel Thread Checker, the Linux version of the Thread Profiler performs data collection on the Linux platform and performs analysis and display on a Windows system. Like Thread Checker, it works with either Pthreads or OpenMP and with gcc or Intel Compilers for Linux.

Tracing Tools for Linux

To expand support for high-performance computing (HPC), Intel recently purchased t wo tools from German vendor Pallas. Formerly marketed as Vampir* and VampirTrace*, Intel® Trace Analyzer and Intel® Trace Collector make welcome additions to Intel® Software Development Products suite of tools. These products are primarily oriented towards tracking messages passed between processes on clustered systems.

Intel Trace Collector is a low-overhead tracing library that records the execution path and related data for multithreading processes, especially those using the Message Passing Interface (MPI). MPI is a consortium-developed standard for sharing data between processes. Intel Trace Collector specifically supports the LAM implementation of MPI, which is open source (and available at*). Intel Trace Collector is thread-safe and so it can follow MPI actions on a per-thread basis. In addition, it can automate function profiling for all major platforms that use the gcc tool chain, including Linux on IA-32 processors and on the Itanium processor family.

The second product, Intel Trace Analyzer, analyzes and displays the data gathered by Intel Trace Collector. In this way, it provides the developers the capability to identify hotspots where performance can be most improved.

Beyond tracing information, Intel Trace Analyzer also displays detailed information on the application’s runtime behavior. As it does with the event traces, this tool can display execution behavior at various levels of detail and at different levels of system abstraction (process, node, or cluster). Consequently, the effect any one hotspot on an entire cluster’s performance can be gauged accurately.


During the last two years, Intel has significantly expanded its product offerings for Linux developers. The Intel® C/C++ Compiler has become much more closely mapped to gcc as to command-line switches and emitted code, while emitting much, much faster code. The compiler’s support for both Pthreads and OpenMP are testimony to Linux’s addition of robust threading capability.

In addition, Intel offers products that locate thread-specific problems and hotspots. For HPC systems, such as clusters, new trace-oriented tools monitor MPI activity in thread-aware fashion. As Linux continues to find an expanded role as on IT servers and HPC systems, Intel’s support for advanced development tools that make best use of the operating system and the processors below is likely to expand. Meanwhile, links at the end of this article will take you to additional coverage of the products discussed here.



About the Author

Andrew Binstock is the principal analyst at Pacific Data Works LLC. Previously he was the director of PricewaterhouseCoopers’s Global Technology Forecasts. Earlier, he was the editor in chief of UNIX Review, the first mainstream technical publication to provide regular coverage of Linux and UNIX on PCs. Binstock can be reached at

Download the PDF (78KB)

For more complete information about compiler optimizations, see our Optimization Notice.