Analyzing Thread Performance

By Jeff Andrews


The Intel® VTune™ Performance Analyzer powerfully spotlights performance bottlenecks. While many developers have used the VTune analyzer to sample an application, they may not have used it with threaded applications. It is relatively simple to switch between different threads of a profiled application, but that simplicity may not be immediately apparent. This paper will clarify common questions surrounding use of VTune with threaded applications and show what the user needs to look for and to select in the VTune environment.

This paper assumes the reader knows the basic operations of the VTune Performance Analyzer for sampling an application based on clock ticks. Those who wish to familiarize themselves with the VTune environment can do so from the Intel® VTune™ Performance Analyzer page. From there, you can download an evaluation copy of the software and get additional product and technical information as well as other resources.

While knowledge of OpenMP will help the reader to get the most out of this paper, it is not required.

Sample Application

The following code generated the sampling session for the examples in this paper:

#include "stdio.h"

#include "omp.h"

#define MAXITEMS    10000

void main( void )

int i, j;
float *pSrc, *pDst, *pMod;
pSrc = new float[ MAXITEMS ];
pDst = new float[ MAXITEMS ];
pMod = new float[ MAXITEMS ];
for ( i=0; i < MAXITEMS; i++ )

pSrc[ i ] = 1.0f;
pMod[ i ] = 2.0f;

printf( "Thread Tester" );
printf( "Num processors: %d" ), omp_get_num_procs() );

for ( j=0; j < 200000; j++ )
#pragma omp parallel for
for ( i=0; i < MAXITEMS; i++ )
pDst[ i ] = (pSrc[ i ] * 10.0f) / pMod[ i ];

delete [] pMod;
delete [] pDst;
delete [] pSrc;



For the benefit of readers not familiar with OpenMP, the statement #pragma omp parallel for tells the compiler to distribute the FOR loop among the processors in the system. This sample case involves two processors.

VTune Analyzer Configuration

The VTune environment was configured to do "Event Based Sampling" with calibration for the "Event Ratio" of "Clockticks per Instructions Retired".

The sampling duration was set to 20 seconds without any delay.

The application to launch is the console application of the source code shown in the previous section. No parameters are needed when launching the application.

Sampling Results

Module View

The initial view shown after the VTune analyzer has finished profiling the application is the "Module" view. As shown in Figure 1, the "Module" button is highlighted. The numbers shown in the "Selection Summary" area are for all threads of the module. Double-clicking the module, which in this case is "Thread.exe", will open the "Hotspots" view, showing the hotspot data for the entire module.

Figure 1: Module View

Process View

Click on the "Process" button (visible at the top of Figure 1) to switch over to the "Process" view, shown in Figure 2. Notice that the numbers in the "Selection Summary" are slightly higher than that of the "Module" view. This increase is due to other functionality that is not part of the module but is called by the "Thread.exe" module. There is not a large increase for this application, since there are no DLLs being called. Applications that call several DLLs would exhibit much larger differences between the "Process" and "Module" numbers.

Double click the process to open in it in the "Thread" view.

Figure 2: Process View

Thread View

The "Thread" view, shown in Figure 3, displays all of the threads of the application's process. Notice that there are three threads in this view. The top (smallest) thread is for the new and delete commands, which are not part of the processing time of the module. The two bottom threads are for the inner FOR loop, which was broken up into two threads by the OpenMP command, as seen in the source code, because this application was run on a dual-processor machine.

Double click one of the two bottom threads to re-open in this application in the "Module" view.

Figure 3: Thread View

Alternatively, if there are multiple threads in your application, and you want to select more than one thread at a time to view, simply select the threads using the "Control" key when clicking. When all the desired threads have been selected, right-click any of the selected threads to bring up the context menu. Select "Open in New Window" from the context menu and it will take you to the "Module" view showing data for only the selected threads.

Module View #2

We have now returned to the "Module" view, shown in Figure 4. Notice that the numbers listed in the "Selection Summ ary" pane are now about half of what they were when the "Module" view was initially displayed. These numbers are for the thread that was selected in the "Thread" view.

Double-clicking the module will now open it up in the "Hotspots" view, showing only data for the selected thread of the module.

Figure 4: Module View #2

In order to return to viewing all the threads of the profiled module, simply return to the "Thread" view by clicking on the "Thread" button (visible at the top of Figure 4). Once in the "Thread" view, select all the threads of your module and click on the "Module" button. Alternatively, you can switch to the "Process" view (this will automatically select all the threads), select the desired process, and then switch back to the "Module" view.

Related Resources

Related Download

  • Click to download (ZIP 178KB) the MSVC6, CPP and Intel® VTune™ Performance Analyzer pack-and-go project files.


Performance Tools

Developer Centers


About the Author

Jeff Andrews is an Application Engineer with Intel Corporation specializing in optimizing code for ISVs. He also r esearches software technologies that enhance performance of applications on Intel processors.

Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.