# Granularity and Parallel Performance

Published:02/01/2012   Last Updated:02/01/2012

Granularity and Parallel Performance (PDF 270KB)

## Abstract

One key to attaining good parallel performance is choosing the right granularity for the application. Granularity is the amount of real work in the parallel task. If granularity is too fine, then performance can suffer from communication overhead. If granularity is too coarse, then performance can suffer from load imbalance. The goal is to determine the right granularity (usually larger is better) for parallel tasks, while avoiding load imbalance and communication overhead to achieve the best performance.

## Background

The granularity is often related to how balanced the work load is between threads. While it is easier to balance the workload of a large number of smaller tasks, this may cause too much parallel overhead in the form of communication, synchronization, etc. Therefore, one can reduce parallel overhead by increasing the granularity (amount of work) within each task by combining smaller tasks into a single task. Tools such as the Intel® Parallel Amplifier can help identify the right granularity for an application.

The following examples demonstrate how to improve the performance of a parallel program by decreasing the communication overhead and finding the right granularity for the threads. The example used throughout this article is a prime-number counting algorithm that uses a simple brute force test of all dividing each potential prime by all possible factors until a divisor is found or the number is shown to be a prime. Because positive odd numbers can be computed by either (4k+1) or (4k+3), for k = 0, the code will also keep a count of the prime numbers that fall into each form. The examples will count all of the prime numbers between 3 and 1 million.

The first variation of the code shows a parallel version using OpenMP*:

``````#pragma omp parallel
{ int j, limit, prime;
#pragma omp for schedule(dynamic, 1)
for(i = 3; i <= 1000000; i += 2) {
limit = (int) sqrt((float)i) + 1;
prime = 1; // assume number is prime
j = 3;
while (prime && (j <= limit)) {
if (i%j == 0) prime = 0;
j += 2;
}

if (prime) {
#pragma omp critical
{
numPrimes++;
if (i%4 == 1) numP41++;  // 4k+1 primes
if (i%4 == 3) numP43++;  // 4k-1 primes
}
}
}
}
``````

This code has both high communication overhead (in the form of synchronization), and an individual task size that is too small for the threads. Inside the loop, there is a critical region that is used to provide a safe mechanism for incrementing the counting variables. The critical region adds synchronization and lock overhead to the parallel loop as shown by the Intel Parallel Amplifier display in Figure 1. Figure 1. Locks and Waits analysis results demonstrating that the OpenMP* critical region is cause of synchronization overhead.

The incrementing of counter variables based on values within a large dataset is a common expression that is referred to as a reduction. The lock and synchronization overhead can be removed by eliminating the critical region and adding an OpenMP reduction clause:

``````#pragma omp parallel
{
int j, limit, prime;
#pragma omp for schedule(dynamic, 1)
reduction(+:numPrimes,numP41,numP43)
for(i = 3; i &;lt;= 1000000; i += 2) {
limit = (int) sqrt((float)i) + 1;
prime = 1;  // assume number is prime
j = 3;
while (prime && (j &;lt;= limit))
{
if (i%j == 0) prime = 0;
j += 2;
}

if (prime)
{
numPrimes++;
if (i%4 == 1) numP41++;  // 4k+1 primes
if (i%4 == 3) numP43++;  // 4k-1 primes
}
}
}
``````

Depending on how many iterations are executed for a loop, removal of a critical region within the body of the loop can improve the execution speed by orders of magnitude. However, the code above may still have some parallel overhead. This is caused by the work size for each task being too small. The schedule (dynamic, 1) clause specifies that the scheduler distribute one iteration (or chunk) at a time dynamically to each thread. Each worker thread processes one iteration and then returns to the scheduler, and synchronizes to get another iteration. By increasing the chunk size, we increase the work size for each task that is assigned to a thread and therefore reduce the number of times each thread must synchronize with the scheduler.

While this approach can improve performance, one must bear in mind (as mentioned above) that increasing the granularity too much can cause load imbalance. For example, consider increasing the chunk size to 10000, as in the code below:

``````#pragma omp parallel
{
int j, limit, prime;
#pragma omp for schedule(dynamic, 100000)
reduction(+:numPrimes, numP41, numP43)
for(i = 3; i <= 1000000; i += 2)
{
limit = (int) sqrt((float)i) + 1;
prime = 1; // assume number is prime
j = 3;
while (prime && (j <= limit))
{
if (i%j == 0) prime = 0;
j += 2;
}

if (prime)
{
numPrimes++;
if (i%4 == 1) numP41++;  // 4k+1 primes
if (i%4 == 3) numP43++;  // 4k-1 primes
}
}
}
``````

Analysis of the execution of this code within Parallel Amplifier shows an imbalance in the amount of computation done by the four threads used, as shown in Figure 2. The key point for this computation example is that each chunk has a different amount of work and there are too few chunks to be assigned as tasks (ten chunks for four threads), which causes the load imbalance. As the value of the potential primes increases (from the for loop), more iterations are required to test all possible factors for prime numbers (in the while loop). Thus, the total work for each chunk will require more iteration of the while loop than the previous chunks. Figure 2. Concurrency analysis results demonstrating the imbalance of execution time used by each thread.

A more appropriate work size (100) should be used to select the right granularity for the program. Also, since the difference in the amount of work between consecutive tasks will be less severe than the previous chunk size, a further elimination of parallel overhead can be accomplished by using the static schedule rather than dynamic. The code below shows the change in the schedule clause that will virtually eliminate the overhead from this code segment and produce the fastest overall parallel performance.

``````#pragma omp parallel
{
int j, limit, prime;
#pragma for schedule(static, 100)
reduction(+:numPrimes, numP41, numP43)
for(i = 3; i <= 1000000; i += 2)
{
limit = (int) sqrt((float)i) + 1;
prime = 1;  // assume number is prime
j = 3;
while (prime && (j <= limit))
{
if (i%j == 0) prime = 0;
j += 2;
}

if (prime)
{
numPrimes++;
if (i%4 == 1) numP41++;  // 4k+1 primes
if (i%4 == 3) numP43++;  // 4k-1 primes
}
}
}
``````

Parallel performance of multithreaded code depends on granularity: how work is divided among threads and how communication is accomplished between those threads. Following are some guidelines for improving performance by adjusting granularity:

• Understand how much work is being done in various parts of the application that will be executed in parallel.
• Understand the communication requirements of the application. Synchronization is a common form of communication, but also consider the overhead of message passing and data sharing across memory hierarchies (cache, main memory, etc.).
• Know the costs of launching parallel execution and synchronization with the threading model on the target platform.
• Make sure that the application's work per parallel task is much larger than the overheads of threading.
• Use the least amount of synchronization possible and use the lowest-cost synchronization possible.
• Use a partitioner object in Intel® Threading Building Blocks parallel algorithms to allow the task scheduler to choose a good granularity of work per task and load balance on execution threads.
• In Intel Parallel Amplifier "Locks and Waits" analysis, look for significant lock, synchronization, and parallel overheads as a sign of too much communication.
• In Intel Parallel Amplifier "Concurrency" analysis, look for load imbalance as a sign of the granularity being too large or tasks needing a better distribution to threads.