Automatic Parallelization: Design Pattern

It is a well known fact that using more CPU cores will result in more processing power. The problem is that most algorithms have some degree of dependency. For example:

X[n] = (X[n-1] * 12) + 11
Y[n] = (Sqrt( X[n] * X[n-1] ) + 15) * 33

See the part in red: X[n-1]. This means that X[n] is dependant on X[n-1] and in other words that you cannot evaluate a given X unless the previous X was already evaluated.

Normally what we do is start with the first iteration and keep going to the last. This way we can make sure that every given X[n] has the previous X[n-1] already prepared and given. This way:
[c-sharp]for (int n = 0; n < 10000; n++)
{
X[n] = (X[n-1] * 12) + 11
Y[n] = (Math.Sqrt( X[n] * X[n-1] ) + 15) * 33
}[/c-sharp]
The code is given in C# for readability and can easily be implemented using C++ with a wrapper layer.

The code above works but it is difficult to make it run in parallel. The following design pattern can help us solve this.

For simplicity instead of using complex data structures for the algorithm we will use a simple structure with only a single double (floating point) as a member:
[c-sharp] public class DoubleValue
{
public double Value;
}[/c-sharp]
The code has two separate functions: One to calculate Y[n] and another to calculate X[n]:
[c-sharp] public static DoubleValue CalcX(int n)
{
DoubleValue retval = new DoubleValue();
if (0 == n) retval.Value = 51;
else retval.Value = Math.Pow(Math.Sqrt((X[n - 1].Value * X[n - 1].Value) + 2), 1.00001);
return (retval);
}

public static DoubleValue CalcY(int n)
{
DoubleValue retval = new DoubleValue();
if (0 == n) retval.Value = 0;
else retval.Value = Math.Pow(Math.Sqrt((X[n].Value * X[n - 1].Value) + 12), 1.001);
return (retval);
}[/c-sharp]
Math.Pow( ) raises A in power of B; Math.Sqrt calculate squair root

We create an array of results so every X[n] and y[n] is evaluated only once.

The design pattern works by consuming data on demand instead of preparing results in advance.  A special kind of array is required for this purpose. The idea is that when calculating any X[n] which requires X[n-1] we start with X[n] and suspend X[n] to resolve X[n-1] if it was not already prepared. This makes the algorithm evaluate data on demand.

Before I show you a simple  implementation of the on-demand evaluation pattern I will start with the 'user' code:
[c-sharp]public static DoubleValue CalcX(int n)
{
DoubleValue retval = new DoubleValue();
if (0 == n) retval.Value = 51;
else
{
for (int i = 0; i < 100; i++)
{
retval.Value = Math.Pow(Math.Sqrt((X[n - 1].Value * X[n - 1].Value) + 2), 1.00001);
}
}
return (retval);
}[/c-sharp]
We are looping 100 times as a pseudo just so we can evaluate times, otherwise the code executes too fast to show the difference.
[c-sharp]public static DoubleValue CalcY(int n)
{
DoubleValue retval = new DoubleValue();
if (0 == n) retval.Value = 0;
else
{
for (int i = 0; i < 5000; i++)
{
retval.Value = Math.Pow(Math.Sqrt((X[n].Value * X[n - 1].Value) + 12), 1.001);
}
}
return (retval);
}[/c-sharp]
I use 5000 iterations for any Y calculation and 100 iterations for any X calculation to demonstrate an algorithm in which the dependency is a small portion of the calculation.

Here is how we use the algorithm to calculate 100000 iterations:
[c-sharp] const int numberOfItems = 10000;

private void button_Serial_Click(object sender, EventArgs e)
{
double val = 0;
for (int i = 0; i < numberOfItems; i += 4)
{
val = Y[i].Value;
}
val = Y[numberOfItems - 1].Value;
}[/c-sharp]
The code is calling for "val = Y[numberOfItems - 1].Value;", Y[99,999] is the last value in the array and would require all X values to evaluate first. This is after we evaluated every 4'th Y value by using i += 4. This is to show you that values which are not really required are completely skipped and thus all X values must evaluate but only a quarter of Y values are evaluated. The code above is the serial version and here is the parallel version of the code:
[c-sharp]private void button_Parallel_Click(object sender, EventArgs e)
{
for (int i = 0; i < numberOfItems; i += 4)
{
Y.Invoke(i);
}
Y.Complete();
double val = Y[numberOfItems - 1].Value;
}[/c-sharp]
The code above is calling Invoke(i) which will set this iteration as pending to a thread pool. After all required iterations are pending we call Complete() which will wait untill all values are evaluated. This time the different iterations of Y are run in parallel. According to the design pattern when Y[n] requires X[n] the code calculating Y[n] will suspend until X[n] is evaluated. Respectively X[n] will suspend until X[n-1] was evaluated. This will ensure that only required calculations are actually performed and that dependencies are respected when needed while still running free when possible.

Finally we get to the demo code.

The user code below mesures accurate times. Results are machine dependant. On my laptop I got 2500ms for the serial code and 1437ms for the parallel code. We don't get a 2 times preformance gain because of iteration overhead in .Net and most importantly because this is what we expected. There are dependencies which prevent fully parallel code. This specific sample is easy to read and works on Visual Studio 2008. Great performance increase is expected when using lambda expression with Visual Studio 2010.

Here is the user code:
[c-sharp]private void button_Go_Click(object sender, EventArgs e)
{
X = new ResultList<DoubleValue>(numberOfItems, CalcX);
Y = new ResultList<DoubleValue>(numberOfItems, CalcY);
DateTime start = DateTime.Now;
double val = 0;
for (int i = 0; i < numberOfItems; i += 10)
{
val = Y[i].Value;
}
val = Y[numberOfItems - 1].Value;
label1.Text = (DateTime.Now - start).TotalMilliseconds.ToString();
for (int i = 0; i < numberOfItems; i += 1000)
{
listBox1.Items.Add(\"Y[\" + i.ToString() + \"] = \" + Y[i].Value.ToString());
}

X = new ResultList<DoubleValue>(numberOfItems, CalcX);
Y = new ResultList<DoubleValue>(numberOfItems, CalcY);
start = DateTime.Now;
for (int i = 0; i < numberOfItems; i += 10)
{
Y.Invoke(i);
}
Y.Complete();
val = Y[numberOfItems - 1].Value;
label1.Text = label1.Text +\"\n\"+ (DateTime.Now - start).TotalMilliseconds.ToString();
for (int i = 0; i < numberOfItems; i += 1000)
{
listBox1.Items.Add(\"Y[\" + i.ToString() + \"] = \" + Y[i].Value.ToString());
}
}[/c-sharp]
The array used in the code above is of type ResultList which is a special type of list object which was designed to accomodate the design pattern. The code was writen for simplicity:
[c-sharp]class ResultList<T> where T: new()
{
public ResultList(int count, Calculate calculator)
{
Calculator = calculator;
Values = new T[count];
LockObjects = new object[count];
for (int i = 0; i < LockObjects.Length; i++) { LockObjects[i] = new object(); }
}
public delegate T Calculate(int iteration);
public Calculate Calculator = null;
protected T[] Values = null;
protected object[] LockObjects = null;
protected List<IAsyncResult> PendingOps = new List<IAsyncResult>();

public T this[int n]
{
get { return (Execute(n)); }
}
protected T Execute(int n)
{
T retval = default(T);
Monitor.Enter(LockObjects[n]);
if (null == Values[n]) Values[n] = Calculator(n);
retval = Values[n];
Monitor.Exit(LockObjects[n]);
return (retval);
}
public T Invoke(int n)
{
T retval = default(T);
Monitor.Enter(LockObjects[n]);
retval = Values[n];
if (null == retval)
{
PendingOps.Add(Calculator.BeginInvoke(n, null, n));
}
return (retval);
}
public void Complete()
{
for (int i = 0; i < PendingOps.Count; i++)
{
int n = ((int)PendingOps[i].AsyncState);
Values[n] = Calculator.EndInvoke(PendingOps[i]);
Monitor.Exit(LockObjects[n]);
}
PendingOps.Clear();
}
}[/c-sharp]
Finally here is the part which is algorithm specific:
[c-sharp]const int numberOfItems = 10000;

public class DoubleValue
{
public double Value;
}

static ResultList<DoubleValue> X;
static ResultList<DoubleValue> Y;

public static DoubleValue CalcX(int n)
{
DoubleValue retval = new DoubleValue();
if (0 == n) retval.Value = 51;
else
{
for (int i = 0; i < 100; i++)
{
retval.Value = Math.Pow(Math.Sqrt((X[n - 1].Value * X[n - 1].Value) + 2), 1.00001);
}
}
return (retval);
}

public static DoubleValue CalcY(int n)
{
DoubleValue retval = new DoubleValue();
if (0 == n) retval.Value = 0;
else
{
for (int i = 0; i < 5000; i++)
{
retval.Value = Math.Pow(Math.Sqrt((X[n].Value * X[n - 1].Value) + 12), 1.001);
}
}
return (retval);
}

private void Form1_Load(object sender, EventArgs e)
{
X = new ResultList<DoubleValue>(numberOfItems, CalcX);
Y = new ResultList<DoubleValue>(numberOfItems, CalcY);
}[/c-sharp]
I would like to hear your opinion about this. Do you find this method useful for your application? Where in the application? Pros and cons are welcome as well.

Generally speaking I see this type of design patterns as the mid-range future of parallel computing and hopefully there are more to come. I am referring to the concept here as a design pattern when it is actually a pattern class which should have specific patterns for many particular cases as suitable for a real design pattern.

如需更全面地了解编译器优化,请参阅优化注意事项