This article is one of a series illustrating some basic practices of concurrent or parallel solution design, which are expected to be within the common mastery of software developers in the 21st century. Another version of this article with more figures illustration can be found as the attached. Feedbacks and comments are all appreciated.

As a symbol of work load, iterations in an application always draw great attention. In the sequential (that is, non-parallel) program world, compilers always pay a lot of attention to loops because iterative work load normally implies the potential of optimizations. In fact, so do the concurrent/parallel programs. Using an example of a known numeric algorithm, the linear Least-Squares Estimation (l-LSqE), we are going to see how parallelism is explored and gives improvements to performance for iterative computation.

l-LSqE is a technique of numeric approximation (http://en.wikipedia.org/wiki/Linear_least_squares), used to abstract a series of 2D points {x_{i}, y_{i} } with a "best-fit" line, so that the sum of squared vertical distances between each point (x_{i}, y_{i} ), i = 0, . . ., n - 1 and a corresponding point (x_{i}, y(x_{i})) along y (x) is minimized. The final output of the estimation is the two critical parameters, b_{1}and b_{0} to describe a line, y=b_{1}x+b_{0},

where b_{1} = (n*sigma(x_{i}*y_{i} ) - sigma(x_{i}) * sigma(y_{i})) / (n*sigma(x_{i}^{2})-(sigma(x_{i}))^{2}),

b_{0 }= (sigma(y_{i})-b_{1}*sigma(x_{i})) / n

We don't have to note in this context how the two parameters are obtained, but, as the first step of concurrent solution design, we need to explore the parallelism inside the algorithm. Being a commonly seen construct in a program, repetitive work indicates a good opportunity for data decomposition.

If dependency allows, iteration of loop can be partitioned and assigned as independent work units, so that they can be executed in parallel. A quick exam at the formula will tell that the major work load is the summations (sigmas) of the productsregarding variables x_{i }and y_{i}, which typically are implemented with loops. The dependency analysis (using the dependence graphs, for example) shows that there is no dependency between the multiplications in each iteration of the loop. But there is output dependency (http://en.wikipedia.org/wiki/Data_dependency) for the product in each iteration to the final summation variables. That means the multiplication of each iteration can be done in parallel in the vocabulary of parallel programming that is called embarrassingly parallel, while the forming of the summations will need special care in the concurrent design.

When the work load is partitioned with dependency identified, the next step is to find a proper implementation model for it. Assume that the following

C code is the implementation. Here variable i is the simplified x_{i} for x-axe.

sumx = 0.0;

sumy = 0.0;

sumx2 = 0.0;

sumxy = 0.0;

for (i = 0; i < n; i++) {

sumx = sumx + i;

sumy = sumy + y[i];

sumx2 = sumx2 + pow(i, 2.0);

sumxy = sumxy + (i * y[i]);

}

b1 = (sumxy - ((sumx * sumy)/(double)n)) / (sumx2-(pow(sumx,2.0)/(double)n));

b0 = (sumy - ((b1) * sumx)) / (double)n;

To take advantage of the parallelism, threading is one implementation model that developers can normally use to divide and execute the iterations in parallel. Here we'll use the easy-of-use OpenMP constructs because of its incremental threading mechanism for quick performance gain. The parallelism is achieved by just adding a compiler directive like "#pragma omp parallel for" before the for loop. The compiler will create threads and assign different iterations to threads.

After compiling the code with OpenMP supported compiler withn = 1000, for example, what do we get? Is the result correct?

Maybe not. The dependency analysis we did and OpenMP practices would remind us that in the concurrent solution design share/private resource and the communication/synchronizations need to be taken care of. Variables sumx, sumy, sumx2 and sumxy are shared by default between threads. They should either be guarded by synchronizations or made local. With respect to the code here, the Reduction construct is a good fit. And the directive is enhanced to "#pragma omp parallel for reduction(+:sumx, sumy, sumx2, sumxy )". And the parallelized code would be like following.

sumx = 0.0;

sumy = 0.0;

sumx2 = 0.0;

sumxy = 0.0;

#pragma omp parallel for reduction(+:sumx, sumy, sumx2, sumxy )for (i = 0; i < n; i++) {

sumx = sumx + i;

sumy = sumy + y[i];

sumx2 = sumx2 + pow(i, 2.0);

sumxy = sumxy + (i * y[i]);

}

b1 = (sumxy - ((sumx * sumy)/(double)n)) / (sumx2-(pow(sumx,2.0)/(doubl e)n));

b0 = (sumy - ((b1) * sumx)) / (double)n;

Now we can compile and run the code. To quantify the performance gain we can have, a comparison should be done between the serial and parallel version of l-LSqE. What do we get? Do we get the gain as expected? Probably not.

Even though the result (values of b_{1} and b_{0}) is correct, parallel version doesn't beat the serial version. Why? It is not because of any incorrect implementation, but, we missed checking one important point - the size of the work load, which should be considered in the first place.

To the modern processors, loops with small body (like in this case) and counts, like 1000 are not big enough work loads to parallelize. Overhead paid to parallelization is more costly regarding the work load. So we should often remind ourselves what we are parallelizing for? It should be performance, which means we should always check the work load size whether the parallelism is worth of exploring even before we start doing it. So initiatingn with a bigger number, say 1000000, and see what we can get. I bet parallelism shows a greater result this time than the serial version.

Now we've practiced three of the most basic but extremely important techniques of concurrent/parallel programming: parallelism exploring; dependency analysis; and work load exam, regarding the common program construct of iteration with the known algorithm linear Least-Squares Estimation. Next, we'll examine other frequently seen constructs, such as function (call), recursion with other algorithms or applications.