Performance Analysis of Threading Building Blocks

Intel® Threading Building Blocks (TBB) is a popular abstraction for expressing parallelism in C++ software.  The Threading Building Blocks lead to good decomposition for threading.   But do you know how to check how well it is tuned, so you use Threading Building Blocks most effectively.

Today Douglas Armstrong, Intel® VTune™ Amplifier XE architect, joins me to share tips on using VTune Amplifier XE for tuning TBB software.   VTune Amplifier XE has built-in support for helping find and tune the granularity of domain decomposition in TBB.   Douglas feels this is an under-appreciated feature and has captured some screen shots to share with us.   Douglas created a sample TBB application and analyzed it with the concurrency option of VTune Amplifier XE.   Below we show a screenshot of the summary view.  It appears as though we were almost fully parallel the entire time.



You may think this means we have almost perfect threading as there are not blocked threads or contention on synchronization. But as we scroll through the full summary display, you'll see the the elapsed time of over 8 1/2 seconds. There is also a new metric here called "Overhead Time" that indicates over 1 1/2 seconds, over 10% of the total CPU time of the app was spent on synchronization overhead. You can even see this CPU time being spent in the TBB internals in the list of the top hotspots of the app. Amplifier XE is even putting up a warning note here saying that we may have a problem with synchronization overhead.



Let’s take a look at the results in the “Bottom-up” tab to see what’s going on. If we just look at where the CPU Time is being spent you’ll see that it is all labeled in green, meaning that we were fully utilizing the system’s processors when it was running. About 14 seconds looks productive in do_work. But, notice that some of CPU Time is showing up in TBB functions, this time is also marked as Overhead Time.

picture 3

If we arrange the data by module, instead of the default  arrangement by function, we can see that there is three fourths of a second of overhead time in the TBB DLL, but there is also almost a second of overhead time in the user module. This is because the TBB header files include many templates which cause inline functions to spend time working on TBB overhead within the user modules. If we arranged the data by source file this is confirmed.





So now that we know we have lots of overhead from using TBB, how do we fix it. Let’s start by figuring out where that overhead is coming from. This test application uses TBB for a couple of different algorithms so we need to find out which one is causing the problem. By looking at the function with the most overhead time we see it is labeled “[TBB parallel_for on class inner_body]”. This is Amplifier XE’s way of telling us that the time was spent processing for a TBB parallel_for template based on the class “inner_body”. We could go check out how we are using that in our source code. By selecting that row, the call stack pane on the very right shows the call stacks associated with this selected overhead. Make sure that the stack type selector combo box is set to “CPU Time” instead of “Wait Time” when looking for overhead. The third row down here refers to line of source code where we called parallel_for() in this inner_body class. We can double click on that and see the actual source code.





By looking further we find out that a grainsize of 1 was manually specified for this blocked_range with over 8,000 iterations. The appropriate grainsize is based on finding the right balance of doing enough cycles in the body so the ratio of work to overhead is low while still allowing TBB the flexibility it needs to schedule work on different threads to keep them busy. This grain size is too small. A bad setting in the other direction, a grain size too large, could cause imbalance problems with TBB being unable to break up the work and schedule a pieces to another thread. This could show up in Amplifier XE as poor utilization of CPUs.

For now, let’s fix this grainsize to a larger number and see the result. From the summary, it now looks like we have removed this significant TBB overhead and the elapsed time has dropped from 8.7 seconds to 6.9 seconds. So we have used VTune™ Amplifier XE to help us identify poor grainsize selection and improve performance of this software. Let us know if you have other examples you would like to see illustrated with VTune™ Amplifier XE.



*All data was collected on a laptop with an Intel® Core™ 2 Duo processor running Microsoft Windows* 7 using Microsoft Visual Studio 2008 SP1, and Intel® VTune™Amplifier XE 2011.
For more complete information about compiler optimizations, see our Optimization Notice.