Evaluate instructions-retired data in conjunction with performance data to examine the correctness of threading methodology. The Instructions Retired processor event in the VTune™ Performance Analyzer is a key performance indicator. Instructions Retired can give you quick insight into possible performance problems in your application.
When an instruction is completely executed, it is referred to as being "retired". Thus, the number of instructions that are retired is the amount of work that has been done.
Gather instructions-retired data from both the serial and the threaded versions of the application; if they are significantly different, you should suspect threading problems. When an application is threaded and run in parallel, the amount of work that gets accomplished is roughly still the same, so you should see roughly the same number of instructions retired. In some sense, any extra instructions in the parallel version of the code constitute overhead; if the overhead is too large, you should suspect a problem.
This reasoning only holds for applications with parallel algorithms that resemble their serial counterparts. In the case where the serial and parallel algorithms are fundamentally different, the instructions retired data can vary significantly. In such cases, this argument no longer holds.
A proper parallel version of a piece of code carefully distributes work into two equal size pieces and uses a synchronization variable in a "spin-wait" loop to frequently check the status of the variable. In the figure below, the instructions retired of the proper parallel version matches the serial version, and the average Hyper-Threading Technology scaling is good (1.20x), a 20% performance improvement:
As an illustration of a situation where the workload is imbalanced, the sample code was altered so that one child thread would do 25% of the work and the other child 75%. The figure below shows information analogous to that given in the previous figure for this intentionally faulty version of the code:
The number of instructions retired showed a drastic 1.14x (or 14%) increase. The change is also reflected in the negative Hyper-Threading Technology scaling (0.86x), which corresponds to a 14% performance degradation.
The workload imbalance caused more time to be spent in the spin-wait loop. The continuous looping and checking of the synchronization variable added significant overhead (instructions). These additional instructions on a Hyper-Threading Technology processor can heavily impact performance. Since the primary source of threading overhead is synchronization, the choice of synchronization types is critical to the performance of the application.
When threads are imbalanced, the type of synchronization can affect performance on a Hyper-Threading Technology-enabled processor. As this second figure shows, when the synchronization is changed to use OS calls, there is once again a performance gain. In this case, however, the instructions retired do not indicate that there is a load-balancing problem that gives inferior scaling. The instructions-retired event indicates places where the time is being wasted, and the OS synchronization prevents threads from wasted work. These examples point out the dangers of using spin-waits for synchronization. Everything is fine as long as the waits are short, but when a thread must wait a significant amount of time, then work is wasted. That inefficiency is discernible in the instructions-retired data.
One might wonder what would have happened if OS-based synchronization had been used from the beginning. Performing this experiment, one finds that the performance improvement (20%) does not change.
Using the Processor Time counter of the performance object Thread to gauge the correctness of threading methodology is addressed in a separate item:
Processor Time Counter to Evaluate Threading Methodology
"Evaluating Instructions Retired Events on Intel® Processors with Hyper-Threading Technology"