This winter I got tired of shovelling snow in my driveway and sidewalks. Normally we don't get too much snow all at once. Maybe an inch or less a couple of times per season. However, every once in a while, we get 4 or more inches overnight. Moving all that snow is slow and back breaking business. So, I got an electric power shovel this year. (That's me and my new toy during a recent snowfall.)Moving snow is now less stressful. The power shovel can process snow up to 6 inches in depth and throws it up to 20 feet. With our often windy conditions, I've got to make sure I'm aiming in a favorable direction to get the whole driveway cleared 2-4X faster than with a traditional snow shovel.
But what if the Big Game is on and I want to get done faster so that I can get back inside to my comfy chair with my feet propped up before halftime is over? I'll get another power shovel and my wife to help do the job in parallel. If I start at the head of the driveway and she starts at the foot, we will meet in the middle and be done twice as fast as if I did it all myself. Could I get it done even faster?
Since the width of my power shovel is 1 foot and my driveway is 30 feet long, I could hire 28 friends from the neighborhood (each with their own power shovel and extension cord) to join my wife and I and we could all be done after executing one pass over the driveway. Since it takes me four "strokes" of the power shovel to get across the width of my driveway, I could use 120 people and their power shovels to do it all in the time it takes to do one swipe with my own power shovel.
Can you see any problems with this?
Besides being unlikely that I could find 120 people with power shovels and enough outlets to plug them all in, there are bound to be coordination issues in getting too many people to shovel snow in a fixed area. First off, my body is much wider than my shovel's width, so trying to get 30 people--all wider than 1 foot--lined up in a row 30 feet long presents a challenge. With 120 people involved, if I'm at the outer edge of the driveway, I imagine that the three people in front of me would not appreciate having snow blown at them while they try to blow their own patch of snow.
Assuming that everyone involved would like to remain dry and snow-free, coordinating the actions of 120 snow blowers (finding power outlets, stringing extension cords, getting into position) is going to be more time consuming than it would be for just two (watch out for each other near the end) working on the same driveway. The ratio of work done to time spent coordinating is the granularity of the job. In loose terms, we define the granularity of threaded computations as the ratio of work accomplished before a synchronization is needed. Such synchronization includes the obvious acquisition of locks for critical sections and the not so obvious overheads that are required to allocate work or other instructions needed before the computation is allowed to proceed.
A fine-grained computation has little work to the amount of synchronization, while a coarse-grained computation has a large amount of work done before a required synchronization. For best performance it is more desireable to have coarse-grained parallel solution. This is the thesis of the article Granularity and Parallel Performance, which is part of The Intel® Guide for Developing Multithreaded Applications (an update of the previously published "Developing Multithreaded Applications: A Platform Consistent Approach").
The granularity article asserts that the correct granularity of computation is essential to achieve maximum performance in parallel computations. If the granularity is too fine, the majority of execution time will be spent on synchronization overhead rather than real work. On the other side of the coin, if the granularity is too coarse, you may face load balance issues and your application may not scale well when executed on systems with more cores.
To illustrate these points, the article analyzes an application to count and categorize prime numbers using a brute force factor division algorithm. Parallel execution is implemented with OpenMP. Two approaches to improve the granularity of a parallel computation are put forth and illustrated with the application code. The first is to remove any unnecessary synchronization. The original parallel code uses a critical region to enforce mutually exclusive updates to the shared counters in the code. The solution is to use a reduction clause, which creates local copies of each counter and then combines the partial values calculated by each thread into the global copy at the end of the computation.The second approach attempts to increase the amount of work assigned as a task to any one thread. The original code includes an OpenMP schedule clause with a schedule of "dynamic, 1" used. Wehn designing your threaded implementation, to reduce the chance of load imbalance, it is advisable to divide the concurrent computation into as many small tasks as possible. You can't get any smaller than a single number to be tested for primality. Plus, the amount of factors that need to be tested before a proper divisor is found (for composite numbers) varies from one number to the next. Thus, the "dynamic, 1" schedule was chosen.
However, the overhead needed to distribute almost 1,000,000 on demand tasks can be a large part of the overall parallel execution time for the application. Since the request and allocation of the each value to be tested is a synchronization, the granularity of this scheme is as fine-grained as you can get. The article urges the practice of agglomeration to adjust the granularity to a level that is better for achieving parallel performance. During the design phase of the parallel solution, do divide up the concurrent tasks into the most indivisible units possible. Then, gather up tasks into groups such that those larger groups of tiny tasks can be assigned to a thread. In this way, if the overhead to assign 1 value or 10,000 values to a thread is the same, using the larger task size will yield more work done and less synchronization overhead.
Your parallel application work is probably not going to be as simplistic as the prime number example. The article concludes with some general advice for most any parallel programming situation you will face, at least with regards to identifying and implementing the best granularity for maximum parallel performance. Also, Linux programmers should not be put off by the inclusion of screenshots from Intel Parallel Studio. The written advice in this article is still sound and can be applied to Pthreads or other threading models available on Linux platforms. In fact, most articles in the guide will be useful to anyone working to convert serial code to threaded parallel code.
Whether you are removing snow from a driveway, factoring a million numbers to identify primes, computing ocean temperatures in the Gulf of Oman, or simulating airflow over a helicoptor rotor section, to get the best performance from the whole operation you need to deal with the issue of granularity among workers or threads. Keeping the amount of assigned work high and the need for synchronization and coordination low will ensure that a more coarse-grained solution is executed. A fine-grained execution could leave you covered in snow. Or worse.

Comments
Clay - I grew up in Chicago and that has to be the wimpiest snow blower I have ever seen. Don't buy one for your wife! It would be like multi-threading a bunch of Atom processors. ;?)
Bob Davies
Actually, I don't think my wife could wield the snow blower (as wimpy as it is). She gets the shovel and it's up to me to coordinate our parallel removal efforts and be sure I don't blow snow at her.
Clay......where were you last year when we had 4 to 6 feet of snow on my driveway here in Spokane?? I also bought a new electric snow shovel to help clear our roofs before they collapsed!! You look great at the helm of your new toy....keep the wind at your back. Great analogy....tying your driveway project into threads of other parallel programming situations!! You are a terrific teacher!
Excellent analogy, helped me to understand a little more about how does the granularity works in TBB and encouraged me to read the hole The Intel(R) Guide for Developing Multithreaded Applications.
Hi Clay
Good comparison here, and finally it is like management : with one guy doing all, you have all in a brain from instructions to execution data and ouput. issue then is when you have too much to do (too large a task, or a granule) and you start to need someone more or more.... on top of all parrallel computing issues versus synchronisatino and amount of work one can do, we could say that Humans Teams could be considered as multiOS (Operating Systems) Systems and that's why there is so much fun (let's be positive) in driving Teams.
Your analogy majkes me also remember when I was young and we were washing dishes : 3 with mother at washing, 2 kids out of the 5 drying (allocation table on the door of the cupboard). Synchronisation was also a big issue. and we were sometimes choosing to take several plates rather than one by one....
Always the same stories, but good fun
Cheers
JP Bdx
Great way of illustrating the overhead of enabling parallelism and things to consider to best amortize it.
You should write a book. Waaaaiiiit a second, you did. Something like "Confessions of a thread monkey" or "Something Parallel this Way Comes."
No, it was "The Art of Concurrency", which is written in the same clear style as the above article.
Muy interesante encontré este artículo. Me gustaría que identificara los factores y variables que nos permitan optimizar la función dela granularidad en las aplicaciones Multi hilos, un método para identificar los máximos y mínimos, algo así como un punto de inflexión en una curva de una función que relacone los factores que la determinan.
Muy interesante encontré este artículo. Me gustaría que identificara los factores y variables que nos permitan optimizar la función dela granularidad en las aplicaciones Multi hilos, un método para identificar los máximos y mínimos, algo así como un punto de inflexión en una curva de una función que relacione los factores que determinan el desempeño.
Parece que no depende del tamaño del trabajo, porque se puede intuir el comportamiento descrito en la analogía, tan sabiamente enseñada y tan cariñosamente por Clay, que lo mismo sucede en un kilómetro cubierto de nieve, que en cien.
La cantidad de concurrencias semeja los "quitanieves" o hilos que hay al mismo tiempo, por lo que hay una relación implícita entre la cantidad de tareas y la cantidad de hilos planificados en el multiproceso, cantidad de memoria asignada
(en el heap ?) etc. Seguiré investigando para aclarar mis ideas y MUCHAS GRACIAS CLAY. !!!
Gracias por las palabras buenas. Espero que el uso de soplar nieve tuviera sentido. Mis analogías no traducen a veces a una audiencia ancha.
(Translation done via Babel Fish).