Blowing Snow and the Granularity of Parallel Computation

This winter I got tired of shovelling snow in my driveway and sidewalks. Normally we don't get too much snow all at once. Maybe an inch or less a couple of times per season. However, every once in a while, we get 4 or more inches overnight. Moving all that snow is slow and back breaking business. So, I got an electric power shovel this year. (That's me and my new toy during a recent snowfall.)

Moving snow is now less stressful. The power shovel can process snow up to 6 inches in depth and throws it up to 20 feet. With our often windy conditions, I've got to make sure I'm aiming in a favorable direction to get the whole driveway cleared 2-4X faster than with a traditional snow shovel.

But what if the Big Game is on and I want to get done faster so that I can get back inside to my comfy chair with my feet propped up before halftime is over? I'll get another power shovel and my wife to help do the job in parallel. If I start at the head of the driveway and she starts at the foot, we will meet in the middle and be done twice as fast as if I did it all myself. Could I get it done even faster?

Since the width of my power shovel is 1 foot and my driveway is 30 feet long, I could hire 28 friends from the neighborhood (each with their own power shovel and extension cord) to join my wife and I and we could all be done after executing one pass over the driveway. Since it takes me four "strokes" of the power shovel to get across the width of my driveway, I could use 120 people and their power shovels to do it all in the time it takes to do one swipe with my own power shovel.

Can you see any problems with this? 

Besides being unlikely that I could find 120 people with power shovels and enough outlets to plug them all in, there are bound to be coordination issues in getting too many people to shovel snow in a fixed area. First off, my body is much wider than my shovel's width, so trying to get 30 people--all wider than 1 foot--lined up in a row 30 feet long presents a challenge.  With 120 people involved, if I'm at the outer edge of the driveway, I imagine that the three people in front of me would not appreciate having snow blown at them while they try to blow their own patch of snow.

Assuming that everyone involved would like to remain dry and snow-free, coordinating the actions of 120 snow blowers (finding power outlets, stringing extension cords, getting into position) is going to be more time consuming than it would be for just two (watch out for each other near the end) working on the same driveway. The ratio of work done to time spent coordinating is the granularity of the job. In loose terms, we define the granularity of threaded computations as the ratio of work accomplished before a synchronization is needed. Such synchronization includes the obvious acquisition of locks for critical sections and the not so obvious overheads that are required to allocate work or other instructions needed before the computation is allowed to proceed.

A fine-grained computation has little work to the amount of synchronization, while a coarse-grained computation has a large amount of work done before a required synchronization. For best performance it is more desireable to have coarse-grained parallel solution. This is the thesis of the article Granularity and Parallel Performance, which is part of The Intel® Guide for Developing Multithreaded Applications (an update of the previously published "Developing Multithreaded Applications: A Platform Consistent Approach").

The granularity article asserts that the correct granularity of computation is essential to achieve maximum performance in parallel computations. If the granularity is too fine, the majority of execution time will be spent on synchronization overhead rather than real work. On the other side of the coin, if the granularity is too coarse, you may face load balance issues and your application may not scale well when executed on systems with more cores.

To illustrate these points, the article analyzes an application to count and categorize prime numbers using a brute force factor division algorithm. Parallel execution is implemented with OpenMP. Two approaches to improve the granularity of a parallel computation are put forth and illustrated with the application code. The first is to remove any unnecessary synchronization. The original parallel code uses a critical region to enforce mutually exclusive updates to the shared counters in the code.  The solution is to use a reduction clause, which creates local copies of each counter and then combines the partial values calculated by each thread into the global copy at the end of the computation.

The second approach attempts to increase the amount of work assigned as a task to any one thread. The original code includes an OpenMP schedule clause with a schedule of "dynamic, 1" used. Wehn designing your threaded implementation, to reduce the chance of load imbalance, it is advisable to divide the concurrent computation into as many small tasks as possible. You can't get any smaller than a single number to be tested for primality. Plus, the amount of factors that need to be tested before a proper divisor is found (for composite numbers) varies from one number to the next. Thus, the "dynamic, 1" schedule was chosen.

However, the overhead needed to distribute almost 1,000,000 on demand tasks can be a large part of the overall parallel execution time for the application.  Since the request and allocation of the each value to be tested is a synchronization, the granularity of this scheme is as fine-grained as you can get. The article urges the practice of agglomeration to adjust the granularity to a level that is better for achieving parallel performance. During the design phase of the parallel solution, do divide up the concurrent tasks into the most indivisible units possible. Then, gather up tasks into groups such that those larger groups of tiny tasks can be assigned to a thread. In this way, if the overhead to assign 1 value or 10,000 values to a thread is the same, using the larger task size will yield more work done and less synchronization overhead.

Your parallel application work is probably not going to be as simplistic as the prime number example. The article concludes with some general advice for most any parallel programming situation you will face, at least with regards to identifying and implementing the best granularity for maximum parallel performance. Also, Linux programmers should not be put off by the inclusion of screenshots from Intel Parallel Studio. The written advice in this article is still sound and can be applied to Pthreads or other threading models available on Linux platforms. In fact, most articles in the guide will be useful to anyone working to convert serial code to threaded parallel code.

Whether you are removing snow from a driveway, factoring a million numbers to identify primes, computing ocean temperatures in the Gulf of Oman, or simulating airflow over a helicoptor rotor section, to get the best performance from the whole operation you need to deal with the issue of granularity among workers or threads. Keeping the amount of assigned work high and the need for synchronization and coordination low will ensure that a more coarse-grained solution is executed. A fine-grained execution could leave you covered in snow. Or worse.
如需更全面地了解编译器优化,请参阅优化注意事项