# Parallelization of a Polynomial Extrapolation Code

The Problem

Polynomial interpolation is a process that take a set of N points, (xi, yi), and finds a polynomial function, P, such that the line graphing P passes through each of the N points. That is, yi = P(xi), for all 1≤ i ≤ N. Polynomial extrapolation finds the function value of a point, X, along the x-axis that is not within the original data set of N points defining the function. Extrapolating the value of a point within an interpolated polynomial is the goal of the code whose parallelization is described in this article.

Using Lagrange's Formula, you can interpolate the polynomial P and compute a solution for P(X). The computation requires the calculation of N terms that are added together, which describes a polynomial of degree N-1. The computation for each term involves 2N-2 subtractions and 2N-1 multiplications. Plenty of independent computations are available in this formulation, but there is no way to compute an error estimate and it can generate awkward coding that ensures the correct values are used to calculate each term.

An alternative is Neville's algorithm. This formulation uses finite difference calculations as it refines the functional value of increasingly higher degree polynomials. The parallel solution presented here will start from a serial code implementing Neville's algorithm. The serial code is based on code from Numerical Recipes by Press, Flannery, Teukolsky, and Vetterling (Cambridge University Press, 1986). Some of the nitty-gritty mathematical details of the algorithm have been left out of this article.

The Serial Code

Neville's Algorithm starts by computing the function value of X for each of the N degree zero polynomials that are simply the constant function at each of the points within the defining data set. We let P1 be the value of X in P(X) = y1 and P2 be the value of X in P(X) = y2. Define P3 to PN in a likewise manner.

Once we've computed the values of the zero-degree polynomials, we can do the same for the one-degree polynomials. Let P12 be the value of X in the polynomial through (x1,y1) and (x2,y2), P23 be the value of X in the polynomial through (x2,y2) and (x3,y3), and P34, ..., P(N-1)N are defined similarly. The value of X for the set of two-degree polynomials are computed next followed by the three-, four-, and so on up the value of X in the N-1 degree polynomial. From this description, it doesn't look like the algorithm is much different than applying Lagrange's formula, but applying the formula O(N2) times!

The following table shows a progression of computation (from left to right) of the value of X in the succeeding polynomials for four initial points.

 x1: y1 = P1 P12 x2: y2 = P2 P123 P23 P1234 x3: y3 = P3 P234 P34 x4: y4 = P4

The table is "filled in" from left to right as described above. So, you can think of one generation as the parent of the successive generation one step to the right. The crux of Neville's algorithm is the fact that there is a recursive relationship between parents and the children directly related with those parents. This relationship uses the small differences in the computed functional value at X of the two k-degree polynomial extrapolations to the associated (k+1)-degree polynomial extrapolation. This means that there is a relatively simple formulation that uses the values of P1 and P2 to calculate the value of P12. (If you are interested in the derivation of the calculations, please refer to Section 3.1 of the Numerical Recipes.)

The serial code function, polint(), takes two N length vectors (xa and ya) for the points defining the polynomial, an integer with the value of N (n), and value for X (x). There are two output parameters that return the approximation of the functional value at X of the interpolated polynomial (y) and the estimated error of the approximation (dy). The functions vector() and free_vector() allocate and deallocate memory to hold vectors of the given length.

```void polint (double xa[], double ya[], int n,
double x, double *y, double *dy)
{
int i, m, ns=1;
double den,dif,dift,ho,hp,w;
double *c, *d;

dif = fabs(x-xa[1]);
c = vector(1,n);
d = vector(1,n);
for (i = 1; i <= n; i++) {
dift = fabs (x - xa[i]);
if (dift < dif) {
ns = i;
dif = dift;
}
c[i] = ya[i];
d[i] = ya[i];
}
*y = ya[ns--];
for (m = 1; m < n; m++) {
for (i = 1; i <= n-m; i++) {
ho = xa[i] - x;
hp = xa[i+m] - x;
w = c[i+1] - d[i];
den = ho - hp;
den = w / den;
d[i] = hp * den;
c[i] = ho * den;
}
*y += (*dy = (2*ns < (n-m) ? c[ns+1] : d[ns--]);
}
free_vector (d, 1, n);
free_vector (c, 1, n);
}
```

The first for-loop finds that point, from the polynomial data set, whose range value is closest to the point of interest, x. The index of this point is stored in the variable ns. The c and d arrays are also initialized with the domain values of the initial points. This initialization computes the function values of the zero-degree polynomials.

The nested loop then runs through successive generations (the outer loop) and computes all the children polynomial function values (the inner loop). The results of the computation are stored in the elements of the c and d array. After each generation of polynomial function values is calculated, the error estimate of the polynomial extrapolation at the point of interest is computed and stored in dy. This error (difference) is used to update the value of the approximate value of the current generation of interpolated polynomials at X that is held in y.

A First Parallel Attempt

Of the nested loop structure, it would be best if the parallelism could be placed around the outer loop. Unfortunately, this loop runs over successive generations and cannot be executed in parallel. The inner loop would appear to contain the independent computations that correspond to the columns of polynomial values illustrated in the table figure above.

If we have a loop with independent iterations, an OpenMP solution would be quick and easy. (We can think of this as a domain decomposition, too, on the c and d arrays.) Looking over the code, the variables ho, hp, w, and den are work variables, so they should be private to each thread. The xa array elements are read only, so this can be left shared. With that, we might modify the nested loop structure as follows:

```for (m = 1; m < n; m++) {
#pragma omp parallel for private (ho, hp, w, den)
for (i = 1; i <= n-m; i++) {
ho = xa[i] - x;
hp = xa[i+m] - x;
w = c[i+1] - d[i];
den = ho - hp;
den = w / den;
d[i] = hp * den;
c[i] = ho * den;
}
*y += (*dy = (2*ns < (n-m) ? c[ns+1] : d[ns--]);
}```

Even with a small number of points, running the above code with 2, 3, and 4 threads yields different results than when run in serial. Also, observed results with the same number of threads gave different results during separate runs. Thus, there must be a problem in the parallel region.

Intel Thread Checker could be used to identify the data race that is at the heart of the problem. Visual inspection of the 7 lines of the inner loop body is sufficient to detect that there is a forward reference in the line that calculates a value for w. When OpenMP divides the i-loop into separate chunks, threads will eventually need to read the value of a c array element (c[i+1]) that has been assigned to another thread. It is likely that this element was the first value in the C array that was updated by the other thread. While the serial algorithm expects to be reading the "old" value of c[i+1] at the boundary of the chunk assigned to a thread, it is possible that the thread will actually be reading the "new" value. This will lead the application to compute the wrong results.

If we want to run the inner loop iterations concurrently, that old value of the boundary element will need to be preserved. Unfortunately, this can't be done easily in OpenMP. To do it will involve being able to control the chunks assigned to threads, having threads pull out the "over the boundary" element from the c array, and then perform the loop iterations making sure to execute the special copy of the loop body using the pulled c element.

This sounds so much like an explicit threading algorithm that I've chosen to use Pthreads. Using the OpenMP API could be done to identify individual threads and code the algorithmic steps described in the previous paragraph. But, why not just do the same for an explicit threading library? The Pthreads version of the polint() function is given here:

```void *polint (void *pArg)
{
Param_t *myP = (Param_t *)pArg;

int myid   = myP->tID;
double *xa = myP->Xpoints;
double *ya = myP->Ypoints;
int n      = myP->numPoints;
double x   = myP->X;
double *y  = myP->Y;
double *dy = myP->dy;
double *c  = myP->C;
double *d  = myP->D;

int i, m, ns=1;
double den,dif,dift,ho,hp,w;
int start, end;
double ci1_old;

if (myid == NUM_THREADS-1) end = n;

if (myid == 0){
dif = fabs(x-xa[1]);
for (i = 1; i <= n; i++) {
dift = fabs (x - xa[i]);
if (dift < dif) {
ns = i;
dif = dift;
}
}
*y = ya[ns--];
}
for (i = start; i <= end; i++) {
c[i] = ya[i];
d[i] = ya[i];
}

pth_barrier_wait(&Barr);

for (m = 1; m < n; m++) {
if (end >= n-m) end = n-m;
ci1_old = c[end+1];
pth_barrier_wait(&Barr);
for (i = start; i <= end-1; i++) {
ho = xa[i] - x;
hp = xa[i+m] - x;
w = c[i+1] - d[i];
den = ho - hp;
den = w / den;
d[i] = hp * den;
c[i] = ho * den;
}
i = end;
ho = xa[i] - x;
hp = xa[i+m] - x;
w = ci1_old - d[i];
den = ho - hp;
den = w / den;
d[i] = hp * den;
c[i] = ho * den;
pth_barrier_wait(&Barr);

if (myid == 0)
*y += (*dy = (2*ns < (n-m) ? c[ns+1] : d[ns--]));
} // end m-loop
return 0;
}
```

The Param_t type is a user-defined type to hold the parameters required to execute the function. These parameters are packed into a single object in order to be passed to the polint() function. This object holds the thread ID number (tID), pointers to the point vectors (Xpoints, Ypoints), the number of points in each vector (numPoints), and the value of X on which to extrapolate. Pointers to the variable in which to store the result (Y) and the error estimate (dy) are included along with pointers to the work vectors (C and D). Unlike the serial code, these work vectors are allocated before calling the threaded function since these need to be globally accessible.

After the start and end points of the work vectors for each thread are calculated, the zero thread (myid == 0) finds the point within the domain of the polynomial defining points closest to the point of interest and stores the index into the local ns variable. Later in the code, the zero thread will be the only thread to update the result and error estimates. Thus, it is the only thread that needs to know this index value. All threads initialize their assigned chunk of the work vectors. The pth_barrier_wait() is a barrier function with the Barr object (globally available to all threads) that has been adapted from the barrier implementation given in Programming with POSIX Threads by David R. Butenhof. The barrier synchronizes threads at points of the execution to ensure the preceding computations have completed before subsequent computations are started.

The value of the element just outside of the assigned chunk from the c array is stored in a thread's local copy of ci1_old. This value is used in the computations that follow the barrier and the loop over the assigned chunk (except the last assigned item). Once the final update to the assigned elements from the c and d arrays, a barrier is encountered. The zero thread updates the error estimate (dy) and approximate extrapolated answer (y) before going onto the next iteration of the m-loop (next generation of polynomial values).

In the serial code, the number of iterations of the inner i-loop iterations decreases by one for each successive iteration of the outer loop. Rather than recompute the boundaries of assigned chunks each iteration of the outer loop, the code above has opted to reduce the chunk size of the thread that handles the highest indexed portions of the work vectors. This is done in the first statement of the outer loop. When the end index of a chunk assigned is less than the start index, the thread will not execute any iterations of the inner loop and the computations that follow this loop will be inconsequential. However, threads will still participate in all the barrier function calls to ensure the code doesn't deadlock.

As the computation progresses, threads will "drop out" of the computation until a single thread is left to finish the computations on the last chunk of the work vectors. The alternative would be to have the number of iterations from the inner loop assigned to each thread steadily decrease. As the count of the assigned iterations got smaller and smaller, the overhead of the repeated boundary computations would start to erode any performance gain from the parallel execution. Not to mention that when the number of inner loop iterations becomes less than twice the number of threads, a whole new case of distributing iterations would be required due to the need to pull out the element just over the assigned index boundary. Coding this special case was deemed to be too complex. Thus, at the end of the computation, only one thread is charged with the final inner loop iterations (that continue to get fewer and fewer until the final single polynomial value is done).

Performance

It sucks. There is very little in the seven statements within the body of the inner loop. The need for two barriers within each iteration of the outer loop adds a massive amount of overhead. Also, since the number of iterations within the inner loop is monotonically decreasing, it takes a very large number of initial points to see even a small gain from the parallel execution. The table below shows timing results on the quad-core development platform with serial code and the concurrent code described above when run with 2 and 4 threads using the given number of points.

 Numberof points Serial code 2 threads 4 threads 20 0.001 0.003 0.003 200 0.001 0.003 0.006 1000 0.004 0.015 0.029

Summary

Neville's algorithm is probably not a good candidate for parallelization due to the small amount of work available and the overhead involved with threads. The stability of the algorithm is questionable.  Running the code with more than 1000 points can result in NaN results (even when those point fell within the range [1,20]).  So, it seems unlikely that a data set big enough could be used that might give the parallel executions a chance to even get close to the serial execution.

While it may be more awkward to program, a formulation of Lagrange's method could yield better parallel performance since there would appear to be less synchronization needed between individual terms.