How to Avoid Being Pelted with Grapefruit-sized Hailstones

After having built and run most of the Apprentice level entries for the second problem in Phase 2 of the 2010 Intel Threading Challenge ("Hailstone Numbers"), I thought I would share some of the common themes I've seen and offer some advice on how to improve your parallel code's performance. There are three things that I would like both present and future contestants to know. My hope is that they will be able to apply one or more of them to write stronger threaded code entries for the last two problems.

1. Write a correct serial code and check results as you parallelize


I hope that this is obvious to everyone and it may be that the deadline for entries has prevented some coders to follow this basic procedure. Correctness of application results on small data sets can be quickly checked by pencil and paper. For Threading Challenge problems the data sets can get pretty large and unwieldy to check by hand. Even so, selectively printing results of each computation step can be reviewed to ensure that the correct final results are being computed and printed.

Then, as you add parallelism to the algorithm, be sure to recheck your results and that they match the serial results. If your parallel code disagrees with the serial results on the same inputs, you've likely done something wrong in your last step of code transformation. Undo the last change and try something else, or look for what other error may have been caused by the last code changes.

If you are ready to explore large data sets (serial or parallel), use the forums to publish your results for given input sets. Ask other participants if they are seeing the same output for the same input. If you get two or three others that agree with you, it's a safe bet you've got the right answer. If others agree on different output results, you need to go back and examine your code for a problem (or prove to yourself that the others are incorrect). In the years that we have been running contests like this, I've always seen good cooperation by the forum participants in answering or debating these kinds of questions.

1.5 Use the Intel Manycore Testing Lab


This platform has the same processors and tools that are used by the judges to build and run your applications for execution scoring. If your application works on your Mac OS/X, you should really verify that it will run on the MTL platform. You can also test for scalability of the application. Does it run well with 32 threads? If not, what number of threads will give you the best performance time? Can the code handle a test case that is close to the maximum possible? It is almost impossible to answer such questions if you only have a dual-core laptop for development and testing.

2. Granularity and Load Balance


The more I reflect on it the more I realize that this program is almost perfect as a parallel programming exercise.  The bulk of the computation takes place in the core's cache and doesn't require large blocks of results being stored in main memory. Plus, if you give it a big enough range on input, it's got a lot of computation to be performed overall and a good chance for scaling performance as you add threads to the solution of the problem. However, the calculations for each individual number in the range is relatively small and the number of iterations is unpredictable. How the range is divided and assigned to threads will require the programmer to consider the granularity of the computation and the load balance of work executed by each thread.

More than one program submitted for the Hailstone Numbers problem had the finest granularity possible for this problem: one thread for each number in the range. This is an easy way to parallelize the serial code. Unfortunatley, the overhead of creating and terminating a thread easily dwarfs the time it takes to run through a few multiplication, division, and addition operations plus a few conditional expression tests.

A coarse-grained solution will improve the ratio of threading overhead to computation done per thread such that the overhead becomes almost negligible. Thus, for each thread created, you should assign 10 or 100 or 1000 numbers. The exact amount of numbers to collect into a task will depend on the complexity of the computation involved per number. Implicit threading models like OpenMP or Intel Threading Building Blocks (TBB) can divide up the work among threads in a coarse-grained way, while explicit threads require the programmer to create tasks and assign work to those tasks (an example of how to do this with Pthreads is given in Example 7-2 of The Art of Concurrency).

I've written about granularity, using snow removal as an example, in a past blog.  You can also get some more general tips and advice from the article Granularity and Parallel Performance, which is part of The Intel® Guide for Developing Multithreaded Applications.

The differing amounts of calculation involved for each number within the range means that you cannot simply divide up the range into equal-sized sub-ranges and expect each thread to finish at the same time. Even a cursory glance at the data will tell you that as the value of the numbers increases, there will be more iterations of the computation needed to reach the terminal value. Thus, a more dynamic scheduling mechanism is needed to achieve good load balance.  Such dynamic scheduling is inherent in TBB and is easily specified with a schedule clause in OpenMP. For explicit threads the best advice I can give is to put all the tasks in a shared queue (remember to enforce mutually exclusive access by threads) to distribute the tasks or use a global counter to control assignment of blocks of numbers to be processed by threads. (I can't recall if there is an example of the global counter in AoC, but I can provide one if anyone is interested.) For more advice on load balance issues in threaded codes see Load Balance and Parallel Performance.

3. Synchronization.


(with apologies to Sweet)
Synchronization is like oxygen,
you get too much and your execution time gets high,
you get too little and your application dies (from the wrong answers)

The operation count buckets that hold the histogram information are an obvious shared structure that each thread needs to update. After the number of operations for a number is calculated, the appropriate bucket is incremented. Since two threads could be trying to increment different buckets at the same time, it makes sense for each bucket to be protected by a separate synchronization object (lock). This will improve performance since updates to separate buckets can be done in parallel.

If there are at least as many buckets as there are threads, there is a very small chance that your application will need to wait for a lock to be released on the desired bucket. As the number of buckets decreases, the chance of contending for the same lock on a bucket increases and the performance of the applicaiton goes down as threads spend more time waiting. Then there is the problem of false sharing when you have all those buckets in the same cache line. (See Avoiding and Identifying False Sharing Among Threads to stay out of that trap.)

I hope you recognize that the bucket count updates are part of a reduction operation. That is, if each thread has a private copy of the buckets, the assigned values to each thread can increment the local copies of the buckets and, once all that work is done, the local results can be combined into a single set of bucket totals. This can be done explicitly (Example 7-2 from AoC, again) or with the TBB parallel_reduce algorithm. The OpenMP reduction clause can't handle arrays or array items, so you'll need to something explicit.

If you want to use global buckets with synchronization, be sure to use the lowest overhead synchronization that will accomplish the task. Atomic updates will be the fastest and are available in OpenMP, TBB, and through the Interlocked intrinsics of Win32 threads. For more advice on using the best synchronization see Choosing Appropriate Synchronization Primitives to Minimize Overhead.

Finally


If you're read down this far, thank you. Please forgive the obvious self-promotion of my book for examples that can be instructive. It's the resource that I'm most familar with. If you don't have a copy of The Art of Concurrency, I can provide the code from Example 7-2.

I hope you were able to get at least one nugget of useful information or advice from this post.  Good luck in your future parallel programming adventures.
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.