Has this ever happened to you: You work tirelessly to add threads to your serial code, all your correctness tests are passing, and your application is zooming along almost twice as fast as the serial version on your 2 core machine. Now your friend sees your results and would love to run your program on his machine which is fully-loaded with four cores that are all equipped with Intel® Hyper-Threading Technology (that’s 8 "logical" processors). You're expecting your newly parallelized application to be blazing fast on his machine, maybe even four times faster than it was on yours! But to your dismay… it runs the same speed as it did on the 2 core machine. What's going on? One possibility is that you have a scaling problem.
A scaling problem arises when parallelized software isn't designed to take advantage of more cores when they are available in the hardware. For example, task level parallelism, where a predetermined number of jobs are assigned to the threads, will never scale to core-counts beyond the total number of jobs created. There just isn't enough division of labor to take advantage of more hardware.
Creating parallel software that scales is essential to developing applications that will remain relevant and competitive as hardware evolves without a major redesign effort. Intel® Advisor XE can give you confidence that your newly parallel solution will scale to higher core counts BEFORE you invest the time into threading your code.
Figure 1 shows the output of the Advisor XE Suitability tool when run on an application containing Advisor XE annotations.
The Scalability of Maximum Site Gain graph shows how the application will scale on hardware up to 32 cores. You can see (via the bullseye) that on 32 cores the parallelized region will have about a 5.75x speedup. Not bad, but with 32 cores I want more. Notice the "yes" on the row for "Enable Task Chunking". This recommendation is telling me that if I thread using a paradigm that allows task chunking (like Intel® Threading Building Blocks or Intel® Cilk™ Plus) I can get a much better speed up. The result of selecting this check-box is shown in Figure 2. The bullseye indicates the target "number of cores" selected by the user in the Suitability Report. The white dots represent the speedups for different numbers of cores (the scaling).
By checking this box, I can see that with task chunking I can get near-linear speedup of almost 32x! You can find more information about scalability and task chunking in the Advisor XE documentation and you can download a fully-featured evaluation copy of Advisor XE here.
Advisor XE offers a complete workflow for adding parallelism to an existing application and ensuring scalability is just one of these steps. So if you are going to put in the legwork to add parallelism to your software, don’t forget to think about the future and design in plenty of room for growth for the increasing core-counts to come.
You can find more information about Advisor XE here, and talk with other users in our forum here.