# Implement Threading in a Data-Decomposition Problem

### Challenge

Apply threading to data-decomposition problems in the Implementation Phase of application development. Data-decomposition problems are situations where multiple threads need to be assigned to perform the same functionality on different data.

The serial code shown below computes the value of pi.

#include "stdio.h"
#include "omp.h"
int numIterations = 1000000;
int main()
{
double x, pi, sum=0.0, step;
step = 1./(double)num_steps;
for (int i=1; i < numIterations; i++)
{
x = (i - .5)*step;
sum = sum + 4.0/(1.+ x*x);
}
pi = sum*step;
return 0;
}

Computation of pi is an example of numerical integration, and the accuracy of the computed value increases as the number of processing iterations increases. This is a good example of an extremely parallel problem that could benefit a great deal from threading.

In order to compute the value of pi, this code performs the integration numIterations times. If the VTune™ Performance Analyzer were used on this program, a hot spot would point to this for loop.

### Solution

Implement OpenMP* as the data-decomposition methodology, expressing the parallelism with directives and pragmas. In the case of the code example used here, encapsulate the for loop with an OpenMP parallel region. That technique will cause the for loop to be executed in a parallel region, using the default number of threads that the OpenMP runtime creates. The code sample below shows the modified source with the OpenMP constructs in them:

#include "stdio.h"
#include "omp.h"
int numIterations = 1000000;
int main()
{
double x, pi, sum=0.0, step;
step = 1./(double)num_steps;
#pragma omp parallel
{
for (int i=1; i < numIterations; i++)
{
x = (i - .5)*step;
sum = sum + 4.0/(1.+ x*x);
}
}
pi = sum*step;
return 0;
}

When this modified source is compiled using the Intel® compiler with the /Qopenmp option and run under Intel® Thread Checker, a number of memory conflicts are detected. The errors point to the two variables (x and sum) that are global to the threads and which cause conflicts when the parallel version is run. Accounting for these conflicts should create an error-free implementation of the program. The following code is the modified source for this program with the memory conflicts accounted for:

#include "stdio.h"
#include "omp.h"
int numIterations = 1000000;
int main(int argc, char* argv[])
{
double x, pi, sum=0.0, step;
step = 1./(double)numIterations;
#pragma omp parallel private(x) shared(sum)
{
#pragma omp for reduction(+: sum)
for (int i=1; i < numIterations; i++)
{
x = (i - .5)*step;
sum = sum + 4.0/(1.+ x*x);
}
}
pi = sum*step;
printf("Pi = %fn", pi);
return 0;
}

This methodology is applied until all of the reported errors are addressed. In our final implementation, we used an omp for work-sharing construct that enabled us to rapidly thread our problem. Once an application has been threaded, it has to be checked for correctness by verifying against the results reported by the serial version.

The OpenMP runtime environment handles all of the implementation details. This enables faster development of threaded applications wherever OpenMP is applicable. In addition to faster implementation, the maintenance of the application becomes relatively easier. Once the implementation of the application has commenced, different sections of the application can be easily threaded incrementally using OpenMP.