As part of my focus on software performance, I also support and consult on implementing scalable parallelism in applications. There are many reasons to implement parallelism as well as many methods for doing it - but this blog is not about either of those things. This blog is about the performance advantages of one particular way of implementing parallelism - and, luckily, that way is supported by several models available.
I am talking about task-based parallelism, which means that you design your algorithms in terms of "tasks" (work to be done) instead of the specifics of threads and CPU cores. There are several parallelism models currently available that use tasks: Intel® Threading Building Blocks (TBB), Intel® Cilk Plus, Microsoft* Parallel Patterns Library, and OpenMP 3.0, to name a few. With a task-based library (like the ones mentioned) you can use pre-coded functions, pragmas, keywords, or templates to add parallelism to your application. Besides potentially reducing your development effort, there are many performance-related reasons why this is a good thing! Below, I give 4 of the advantages.
- Forward Scaling
With a task-based design, you (the programmer) ensure that the work done by your algorithm can be broken into many tasks. You want the number of tasks to scale with the size of the problem being solved. For image processing, for example - a larger image would mean more tasks. Then the library you choose would be responsible for distributing those tasks in a balanced way across all available CPU resources. This means that for future platforms, your program will automatically utilize the additional cores available.
- Minimal Overhead
All of the libraries mentioned have been designed with scalability in mind. One of the most important ways that overhead is minimized is by using a threadpool. Generally, the library will create one software thread for each CPU core on the system, but this behavior is usually customizable. Then tasks are scheduled on the threads. So instead of creating and deleting a thread to process an item of work, the same threads are used over and over again. Scheduling a task is much less expensive than creating and deleting a thread. Some libraries, such as Intel TBB, also provide a rich set of containers, templates, and synchronization techniques designed by experts to be as lightweight as possible.
- Cache Locality
All of the libraries mentioned exhibit some degree of cache awareness. In the ideal case, you would want tasks to always be scheduled on the thread that is running on the CPU that contains that task's data. This won't always be possible, but each library uses at least some mechanism(s) to try to optimize this case. For Intel TBB and Intel Cilk Plus, for example, the way that tasks are created is cache aware. Over the course of a parallel region, the task space is expanded into a tree. The work of expanding part of the tree will be done in a breadth-first way, so that different parts of the tree are expanded by different threads, resulting in better cache locality.
- Dynamic Scheduling
One aspect of parallelism that can lead to performance issues is scheduling the work on the available threads. A load imbalance, where some threads are working while others are waiting, wastes CPU resources. All of the task-based parallelism models referred to above use a run-time scheduling system. Tasks are put into queues for the available threads, and when a thread has completed its work, it can "steal" from another thread's queue. This automatic load-balancing helps avoid the potentially significant performance issue of unequal distribution of work.