Intel® Inspector XE, one of the three components within the Intel® Parallel Studio XE suite product, is used to find correctness errors in Windows or Linux applications. Intel Inspector XE automatically finds memory errors, deadlocks and other conditions that could lead to deadlocks, data races, thread stalls, and more.
This article is part of the larger series, "Developing Multithreaded Applications: A Platform Consistent Approach," which provides guidelines for developing efficient multithreaded applications for Intel® platforms.
Debugging threaded applications can be difficult, because debuggers change the runtime performance, which can mask race conditions. Even print statements can mask issues, because they use synchronization and operating system functions. Consider the code sample below from the perspective of what threading errors may lie dormant.
Let's take a close look at some common threading errors. In this code snippet the global variable col is modified in function render_one_pixel. It is clear that if multiple threads are writing to variable col the value in col will be dependent on which thread writes the value last. This is a classic example of a data race.
Race conditions are tricky to detect because, in a given instance, the variables might "win the race" in the order that happens to make the program function correctly. Just because a program works once doesn't mean that it will always work. Testing your program on various machines, some with Hyper-Threading Technology and some with multiple physical processors, is one approach but cumbersome and unpredictable due to problem reproduction during testing. Tools such as the Intel Inspector XE can help. Traditional debuggers may not be useful for detecting race conditions because they cause one thread to stop the "race" while the other threads can continue and may significantly change the runtime behavior thereby obscuring the data race.
Use Intel Inspector XE to facilitate debugging of multithreaded applications. Intel Inspector XE provides very valuable parallel execution information and debugging hints. Using dynamic binary instrumentation, Intel Inspector XE executes your application and monitors common threading APIs, and all memory accesses in an attempt to identify coding errors. It can find the infrequent errors that never seem to happen during testing but always seen to happen at a customer's site readily. These are called intermittent bugs and are unique to multithreaded programming. The tool is designed to detect and locate such notorious errors. The important thing to remember when using the tool is to try to exercise all code paths while accessing the least amount of memory possible, this will speed up the data-collection process. Usually, a small change to the source code or data set is required to reduce the amount of data processed by the application.
To prepare a program for Intel Inspector XE analysis, compile with optimization disabled and debugging symbols enabled. Launch the standalone Intel Inspector XE GUI from the Windows "Start" menu. Create a new Project and specify the application to analyze and the directory for the application to run in. Click on the "New Analysis" icon on the toolbar. Select the 3rd level: "Locate Deadlocks and Dataraces" under the "Threading Error Analysis." Click "Start" to start the analysis.
After clicking "Start," Intel Inspector XE runs the application using dynamic binary instrumentation. Once the application exits, the tool presents a summary of its findings.
Figure 1. Summary overview in Intel Inspector XE
Figure 2. Source view in the Intel Inspector XE
Once the error report is obtained and the root cause is identified with the help of Intel Inspector XE, a developer should consider approaches as to how to fix the problems. The general considerations of avoiding data race conditions in parallel code is discussed below followed by advice on how to fix the problem in the examined code.
Modify code to use a local variable instead of a global
In the code sample the variable col declared at line #80 could be declared as a local declared at line #88. (see the comments provided in the sample) If each thread is no longer referencing global data then there is no race condition, because each thread will have its own copy of the variable. This is the preferred method for fixing this issue.
Use a mutex to control access to global data
There are many algorithmic reasons for accessing global data, and it may not be possible to convert the global variable to a local variable. In these cases controlling access to the global by using a mutex is how threads typically achieve safe global data access.
This sample example happens to use Intel® Threading Building Blocks (Intel® TBB) to create and manage threads but Intel Inspector XE also supports numerous other threading models. Intel TBB provides several mutex patterns that can be used to control access to global data.
In this new snippet, a countMutex is declared as a scoped_lock. The semantics of a scoped_lock are as follows: the lock is acquired by the scoped_lock constructor and released by the destructor automatically when the code leaves the block. Therefore only one thread is allowed to execute render_one_pixel() at a time, if additional threads call render_one_pixel() they will hit the scoped_lock and be forced to wait until the previous thread completes. The use of a mutex does affect performance and it is critical to make the scope of a mutex be as small as possible so that threads wait for the shortest internal.
Use a concurrent container to control access to global data
In addition to using a mutex to control access to global data, Intel TBB also provides several highly concurrent container classes. A concurrent container allows multiple threads to access and update data in the container. Containers provided by Intel TBB offer a higher level of currency by using 2 methods: fine grain locking, lock free algorithms. The use of these containers does come with an additional overhead and tradeoffs do need to be considered with regards to whether the concurrency speedup makes up for this overhead.
Intel Inspector XE currently is available for the 32-bit and 64-bit versions of the Microsoft* Windows XP, Windows Vista, and Windows 7 operating systems and integrates into Microsoft* Visual Studio 2005, 2008 and 2010, it is also available on 32-bit and 64-bit versions of Linux*.
Intel Inspector XE instrumentation increases the CPU and memory requirements of an application so choosing a small but representative test problem is very important. Workloads with runtimes of a few seconds are best. Workloads do not have to be realistic. They just have to exercise the relevant sections of multithreaded code.