Writing Parallel Programs: a multi-language tutorial introduction


Introduction

Parallel programming was once the sole concern of extreme programmers worried about huge supercomputing problems. With the emergence of multi-core processors for mainstream applications, however, parallel programming is well poised to become a technique every professional software developer must know and master.

Parallel programming can be difficult, but its better understood as “just different” not “difficult.” It includes all the characteristics of more traditional, serial programming, but in parallel programming, there are three additional and well defined steps.

  • Identify concurrency: Analyze a problem to identify tasks that can execute concurrently
  • Expose concurrency: Restructure a problem so tasks can be effectively exploited. This often requires finding the dependencies between tasks and organizing the source code so they can be effectively managed.
  • Express concurrency: Express the parallel algorithm in source code using a parallel programming notation

 

Each of these steps is important. The first two are discussed in detail in a recent book on design patterns in parallel programming [mattson05]. In this paper, we focus on step three: expressing a parallel algorithm in the source code using a parallel programming notation. This notation can be a parallel programming language, an application programming interface (API) implemented through a library interface, or a language extension added to an existing sequential language.

Choosing a particular parallel programming notation can be difficult. The learning curve associated with these notations is steep and time consuming. Hence, it is not practical to master several notations in order to choose the one to use. What programmers need is a quick way to learn the “flavor’ and high level characteristics of different notations in sufficient detail to make an intelligent choice of which notation to invest one’s time to master.

In this paper, we provide a high level overview of several parallel programming notations focusing on the major ways they are used, and identify particular strengths and weaknesses of the notation. In particular, we cover the following notations:

  • OpenMP: compiler directives for simple parallel programming
  • MPI: Library routines for the ultimate in high performance portability.
  • Java: Concurrency in the leading object based programming language

 

To make the discussion as concrete as possible, we implement in each case a parallel version of the well known p program. This is a simple numerical integration using the trapezoid rule with integrand and the limits of integration chosen so the mathematically correct answer is p. Many consider this to be the “hello world” program of parallel programming. We then close the paper with a brief discussion of how to choose a parallel programming notation to work with and master.


The p program: trapezoid integration in parallel

In the study of calculus, we learn that an integral can be represented geometrically as the area under a curve. This suggests an a lgorithm to approximate the value of an integral. Take the range of the integral and break it up into a large number of steps. At each step, place a rectangle whose height is the value of the integrand at the center of the step. The sum of the areas of each rectangle equals an approximation to the integral

Figure 1: trapezoid integration - each strip is of fixed width "step". The height of each strip is the value of the integrand. Add all the strips together to approximate the area under the curve, i.e. the value of the integral.

We can select an integrand and limits of integration so the integral mathematically is equal to p. This makes checking the correctness of the program straightforward. A simple C program implementing this algorithm follows:

static long num_steps = 100000;

double step;

void main ()

{

int i;

double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

for (i=0;i<= num_steps; i++){

x = (i+0.5)*step;

sum = sum + 4.0/(1.0+x*x);

}

pi = step * sum;

}

 


OpenMP

OpenMP [omp] is an industry standard API for writing parallel application programs for shared memory computers. The primary goal of OpenMP is to make the loop oriented programs common in high performance computing easier to write. Constructs were included in OpenMP to support SPMD, Master worker, pipeline and most other types of parallel algorithms as well [Mattson05].

OpenMP has been a very successful parallel language. It is available on every shared memory computer on the market. Recently Intel© has created a variation on OpenMP to support clusters as well. OpenMP supports a style of programming where parallelism is added incrementally so an existing sequential program evolves into a parallel program. This advantage, however, is also OpenMP’s greatest weakness. By using incremental parallelism, a programmer might miss the large scale restructuring of a program often required to get the most performance.

OpenMP is a continuously evolving standard. An industry group called “the Ope nMP Architecture Review Board” meets regularly to develop new extensions to the language. The next release of OpenMP (version 3.0) will include a task queue capability. This will allow OpenMP to handle a wider range of control structures as well as more general recursive algorithms.


OpenMP Overview

OpenMP is based on the fork-join programming model. A running OpenMP program starts as a single thread. When the programmer wishes to exploit concurrency in the program, additional threads are forked to create a team of threads. These threads execute in parallel across a region of code called a parallel region. At the end of the parallel region, the threads wait until all of the threads have finished their work, and then they join back together. At that point, the original or “master” thread continues until the next parallel region is encountered (or the end of the program).

The language constructs in OpenMP are defined in terms of compiler directives that tell the compiler what to do in order to implement the desired parallelism. In C and C++ these directives are defined in terms of pragmas.

The OpenMP pragma have the same form in every case

#pragma omp construct_name one_or_more_clauses

 

The construct_name defines the parallel action desired by the programmer while the clauses modify that action or control the data environment seen by the threads.

OpenMP is an explicit parallel programming language. If a thread is created or work is mapped onto that thread, the programmer must specify the desired action. Therefore, even a simple API such as OpenMP has a wide range of constructs and clauses the programmer must learn. Fortunately, a great deal can be done with OpenMP using only a small subset of the full language.

To create a thread in OpenMP, you use the “parallel” construct.

#pragma omp parallel

{

…. A block of statements 

}

 

When used by itself without any modifying clauses, the program creates a number of threads chosen by the runtime environment (often equal to the number of processors or cores). Each thread will execute the block of statements following the parallel pragma. This can be almost any set of legal statements in C, the only exception being that you must not branch into our out of the block of statements. This makes sense if you think about it. If the threads are going to all execute the set of statements and if the resulting behavior of the program is to make sense, then you can’t have arbitrary threads branching into or out of the construct within the parallel region. This is a common constraint in OpenMP. We call this block of statement lacking such branches a “structured block”.

You can do a great deal of parallel programming by having each thread execute the same statements. But to experience the full power of OpenMP, we need to do more. We need to share the work of executing the set of statements among the threads. We call this type of behavior “work sharing”. The most common work-sharing construct is the loop construct which in C is the for the for loop

#pragma omp for

 

This only works for simple loops with the canonical form

for(i=lower_limit; i<upper_limit; inc_exp)

 

The for construct takes the iterations of the loop and parcels them out among a team of threads created earlier with a parallel construct. The loop limits and the expression to increment the loop index (inc_exp) must all be fully determined at compile time and any constants used in these expressions must be the same among all the threads in the team. This makes sense if you think about it. The system needs to figure out how many iterations of the loop there will be and them map them onto sets that can be handed out to the team of threads. This can only be done in a consistent and well behaved manner if all the threads compute the same index sets.

Notice that the for construct does not create threads. You can only do this with a parallel construct. As a short cut, you can put the parallel and for construct together in one pragma.

#pragma omp parallel for

 

This creates a team of threads to execute the iterations of an immediately following loop.

The iterations of the loop must be independent so that the result of the loop is the same regardless of the order the iterations are executed or which threads execute which iteration of the loop. If one thread writes a variable and another thread reads that variable, then we have a loop-carried dependence and program will generate incorrect results. The programmer must carefully analyze the body of a loop to make sure there are no loop carried dependencies. In many cases, the loop carried dependency arises from a variable used to hold intermediate results used within a given iteration of the loop. In this case, you can remove the loop carried dependency by declaring that each thread is to have its own value for the variable. This can be done with a private clause. For example, if a loop uses a variable named “tmp” to hold a temporary value, you could add the following clause to an OpenMP construct so it can be used inside the loop body without causing any loop carried dependencies

private(tmp)

 

Another common situation occurs when a variable appears inside a loop and is used to accumulate values from each iteration. For example, you may have a loop that sums the results of a computation into a single value. This is a common issue in parallel programming. It is called a reduction. In OpenMP, we have a reduction clause

reduction(+:sum)

 

As with the private clause, this is added to an OpenMP construct to tell the compiler to expect a reduction. A temporary private variable is created, and is used to create a partial result of the accumulation operation for each thread. Than at the end of the construct, the values from each thread are combined to yield the final answer. The operation used in the reduction is also specific in the clause. In this case, the operation is “+”. OpenMP defines the value for the private variable used in the reduction based on the identity for the mathematical operation in question. For examp le, for “+”, this value is zero.

There is much more to OpenMP, but with these two constructs and two clauses, we can explain how to parallelize the p program.


The OpenMP p Program

To keep things simple, we will leave the number of steps to be used fixed. And we will only work with the default number of threads. In the serial p program there is a single loop to parallelize. The iterations of the loop are completely independent except for the value of the dependent variable “x” and the accumulation variable “sum”. Notice that “x” is used as temporary storage for the computation within a loop iteration. Hence we can deal with this variable by making it local to each thread with a private clause

private(x)

 

Technically, the loop control index creates a loop carried dependence. OpenMP, however, understands that the loop control index needs to be local to each thread so it automatically makes that index private to each thread.

The accumulation variable, “sum”, is used in a summation. This is a classic reduction so we can use the reduction clause:

reduction(+:sum)

 

Adding these clauses to the “parallel for” construct we have our p program parallelized with OpenMP.

#include "omp.h"

static long num_steps = 100000; double step;

void main ()

{

int i; 

double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

#pragma omp parallel for private(x) reduction(+:sum)

for (i=0;i<= num_steps; i++){

x = (i+0.5)*step;

sum = sum + 4.0/(1.0+x*x);

}

pi = step * sum;

}

 

Note that we also included the standard include file for OpenMP

#include "omp.h"

 

This defines types and runtime library routines OpenMP programmer sometimes need. Note in this program, we didn’t make use of these features of the language, but it’s a good idea to get in the habit of including the OpenMP include file just in cas e later modifications of the program require it.


MPI

MPI, or the Message Passing Interface, is one of the oldest parallel programming APIs in use today. The MPI program is a set of independent processes that interact by sending and receiving messages. The strength of MPI is that it assumes very little of the hardware in the parallel computer. All it requires is that the processors or cores share a simple network sufficient to route messages between any pair of processes. This allows MPI to run on any common parallel system from symmetric multiprocessors to distributed memory, massively parallel supercomputers to clusters.

MPI was born in the early 90’s when clusters were getting started and massively parallel processors (MPPs) dominated high performance computing. Each MPP vendor had their own message passing notation. While the vendors liked this fact since it locked users to their product line, it drove programmers to frustration. Software lasts much longer than hardware. And without a portable notation, every time a new machine was acquired, application programmers had to laboriously translate their software from one message passing notation to another.

MPI wasn’t the first portable message passing library. But it was the first created by an industry/national-lab/academic partnership. By including virtually all the key players in its creation, MPI quickly became the standard message passing interface used in high performance computing. And now, almost 15 years after its creation, MPI is still the most commonly used notation for parallel programming in high performance computing.

Most MPI programs use a Single Program Multiple Data or SPMD pattern [mattson05]. The idea is simple. Each processing element (PE) runs the same program. The processing elements each have a unique integer ID defining their rank within the set of processing elements. The program then uses that rank to divide up the work and decide which PE does which work. In other words, there is a single program but because of the choices made based on the ID, the data can be different between PEs; i.e. it’s a single program, multiple data pattern.


MPI Overview

MPI is an elegant and robust message passing system. It was designed to support a wide range of hardware. It was also designed to support complex software architectures with careful modular designs.

The key idea behind MPI is the communicator. When a set of processes are created, they define a group. The group of processes can share context for their communication. A group of processes combined with a communication context defines a unique communicator. The power of this concept is apparent when you consider the use of libraries in a program. If a programmer isn’t careful, messages created by a library developer could interfere with messages used in the program calling a library. But with communicators, a library developer can create their own communication context and assure that as far as the messages passed around the system are concerned, what happens inside the library stays in the library.

When an MPI program starts up, the default communicator, MPI_COMM_WORLD, is created. The communicator is passed to each MPI routine as the first argument. The other arguments define the source of a message, and the buffers to hold the messages. The MPI routines return an integ er value as an error parameter to inquire about any problems that occurred during the execution of the routine.

MPI programs usually include somewhere near the beginning a call to a trio of routines to setup the use of MPI.

int my_id, numprocs; 

MPI_Init(&argc, &argv) ;

MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ;

MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;

 

The first routine (MPI_Init) takes as input the command line arguments familiar to any C programmer and initializes the MPI environment. The next two routines take as input the MPI communicator (in this case, the default communicator) and return the rank of the calling process and the number of processes in total. The rank is used as a unique identifier of the process and runs from 0 to the number of processes minus one.

The details about how many processes are created and which processors they run on is external to the MPI applications programming interface. Different methods are used depending on the platform supporting MPI. In most cases, there is a host file that lists each of the processors by name. This is passed to a common shell script called mpirun available on most MPI platforms to launch an MPI program. The details of this simple procedure vary from one platform to another so we won’t discuss them here.

At the end of every MPI program there should be a routine to close down the environment. This function returns an integer value as an error code.

int MPI_Finalize();

 

In between these routines, is the work of the MPI program. Most of the program is regular serial code in the language of your choice. As mentioned before, while every process is executing the same code, the behavior of the program is different based on the process rank. At points where communication or some other interaction between processes is required, MPI routines are insterted. The first version of MPI had over 120 routines and the later version (MPI 2.0) was even larger. Most programs, however, use only a tiny subset of MPI functions. We will talk about only one; a routine to carry out a reduction and return the final reduced result to one of the processes in the group.

int MPI_Reduce(void* sendbuf, void* recvbuf, 

int count, MPI_Datatype datatype, MPI_OP op, 

int root, MPI_COMM comm.)

 

This function takes “count” values of type “datatype” in the buffer “sendbuf”, accumulates the results from each process using the operation “op”, and places the result in the buffer “recvbuf” on the processes of rank “root”. MPI_Datatype, MPI_OP take on the intuitively expected values such as “MPI_DOUBLE” or “MPI_SUM”.

The other commonly used routines in MPI broadcase message (MPI_Bcast), define barrier synchronization points (MPI_Barrier), send message (MPI_Send), or receive a message (MPI_Recv). You can find more details about MPI online or in [mpi] or [mattson05].


The MPI p Program

The MPI p program is a straightforward modification of the original serial code. To keep things as simple as possible, we will continue to set the number of steps in the program itself rather than input the value and broadcast it to the other processes.

The program opens with the MPI include file to define datatypes, constants and the routines in MPI. We then include the standard trio of routines to initialize the MPI environment and make the basic parameters (number of processes and rank) available to the program.

#include "mpi.h"

static long num_steps = 100000; 

void main (int argc, char *argv[])

{

int i, my_id, numprocs; 

double x, pi, step, sum = 0.0 ;

step = 1.0/(double) num_steps ;

MPI_Init(&argc, &argv) ;

MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ;

MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;

my_steps = num_steps/numprocs ;

for (i=my_id; i<num_steps; i+numprocs)

{

x = (i+0.5)*step;

sum += 4.0/(1.0+x*x);

}

sum *= step ; 

MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,

MPI_COMM_WORLD) ;

MPI_Finalize(ierr);

}

 

We then use a common trick to divide the iterations of the loop among the collection of processes. Notice that the loop limits where changed to run from the ID of each process to the number of iterations with the increment as the number of processes in the group. This works since the rank as defined in MPI is used as the ID and it runs from zero to the number of processes minus one. In essence, this simple transformation parcels out the loop iterations in a round robin manner as if dealing out a deck of cards to the various processes.

Once each processes completes is share of the summation putting the resulting partial sum into the variable “sum”, the reduction is included with the call:

MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, 

MPI_COMM_WORLD) ;

 

The meaning of each argument should be clear compared to the definition of MPI_Reduce that we discussed in the previous section. We are using the partial sum, “sum” as the send buffer and the variable “pi” as the receive buffer. This value will arrive on the process of rank “0” as given by the sixth argument of the MPI_Reduce routine. The send buffer contains one value of type MPI_DOUBLE with the accumulation operation of addition (i.e. MPI_SUM). Finally, the processes involved in this reduction use the communicator MPI_COMM_WORLD.


Java threads Overview

The Java language was designed with multithreading support built-in. Threads are one of the key pieces of Java technology, they are supported both on the language (syntactic) level and as well as at the Java virtual machine and class libraries levels. In many places Java threads are very similar to POSIX pthreads. The Java class libraries provide a thread class that supports a rich collection of methods to start, run or stop a thread, and check on a thread's status.

Java’s threading support includes a sophisticated set of synchronization primitives based on monitors and condition variables. At the language level, methods within a class or blocks of code that are declared synchronized do not run concurrently. Such methods or blocks run under control of monitors that help to ensure that the data accessed within these methods or blocks remain in a consistent state. Every Java object has its own monitor that is typically instantiated and activated by the JVM at the time of its first use. Monitors act pretty much like a pair of conditional variable and mutex as they are defined in pthreads. But, unlike for pthreads, it is possible to interrupt a Java thread while it is in a wait state, for example if it is waiting for event notification or blocked at I/O call.


The Java threads pprogram

In this simple example we show how one would write a parallelized version of p program with help of “plain” Java threads:

public class PI1 {

static long num_steps = 100000;

static double step;

static double sum = 0.0;

static int part_step;

static class PITask extends Thread {

int part_number;

double x = 0.0;

double sum = 0.0;

public PITask(int part_number) {

this.part_number = part_number;

}

public void run() {

for (int i = part_number; i < num_steps; i += part_step)

{

x = (i + 0.5) * step;

sum += 4.0 / (1.0 + x * x);

}

}

}

public static void main(String[] args) {

;

int i;

double pi;

step = 1.0 / (double) num_steps;

part_step = Runtime.getRuntime().availableProcessors();

PITask[] part_sums = new PITask[part_step];

for (i = 0; i < part_step; i++) {

(part_sums[i] = new PITask(i)).start();

}

for (i = 0; i < part_step; i++) {

try {

part_sums[i].join();

} catch (InterruptedException e) {

}

sum += part_sums[i].sum;

}

pi = step * sum;

System.out.println(pi);

}

}

 

To start a new thread in Java, one would typically subclass a Thread class and define a custom run() method which holds the main work to be done in parallel. In our example this work is implemented in the run() method of a class PITask. For performance reasons, the whole integration range has been split into part_step pieces such that the number of these pieces is equal to the number of available processors. PITask objects are parameterized by the part_number which denotes a piece in the integration range. Thus the body of the run()calculates the partial sum over the selected integration subrange. The actual thread is started and executed concurrently when the start() method is called. We do this in a loop for all subranges. Then, we run a second loop where we wait for each spawned thread to finish by calling it’s join() method and then summarize the results obtained from each thread. In this example each integration range is explicitly mapped into a separate Java thread.

In this example Java threads were created explicitly and, as a result, we had to manually partition a work between threads by splitting integration range into pieces. If we didn’t do that and created just as many threads as the steps in the integration range are, then we would notice that the performance of the program is just unacceptable. This is because creation of a Java thread tends to be, in general case, quite an expensive operation.


Java concurrent FJTask framework

While “plain” Java threads mentioned in the previous section are the bottommost level of the Java threading support, there are many threading libraries at higher levels which are aimed to enhance the basic level of Java threading functionality as well as add solutions to some well-known tasks. One noticeable example is java.util.concurrent package which is available in Java standard since 1.5. This package includes many enhancements over basic Java threading, such as thread pools support, atomic variables, sophisticated synchronization primitives. However, some of the pieces for util.concurrent package didn’t get into J2SE standard and are still available only as a standalone library called EDU.oswego.cs.dl.util.concurrent. The essential missing piece of that is FJTask framework which adopts a concept of fork-join parallelism for Java and is primarily targeted for parallelizing compute intensive calculations such as numeric integrals or matrix multiplication. FJTasks are lightweight, stripped-down analogs of Threads. This is often referred to as a “task-based” parallelism verses “thread-based” parallelism. FJTasks are typically executed on the same pool of Java threads. FJTasks support versions of the most common methods found in class Thread, including start(), yield() and join().

FJTasks don't support some of the Java thread methods like priorities control. The main economies of FJTasks stem from the fact that FJTasks do not support blocking operations of any kind. There is nothing that prevents one from blocking within a FJTask, and very short waits/blocks are completely well-behaved. FJTasks are not designed to support arbitrary synchronization since there is no way to suspend and resume individual tasks once they have begun executing. FJTasks should also be finite in duration, and they should not contain infinite loops. FJTasks should just run to completion without issuing waits or performing blocking IO. The overhead differences between FJTasks and Threads are substantial. FJTasks can be two or three orders of magnitude faster than Threads, at least when run on JVMs with high-performance garbage collection (every FJTask quickly becomes garbage) and good native thread support

The Java concurrent FJTask pi program

In the below example we show how one would write the PI program with help of FJTask framework:

import EDU.oswego.cs.dl.util.concurrent.FJTask;

import EDU.oswego.cs.dl.util.concurrent.FJTaskRunnerGroup;

public class PI2 {

static int num_steps = 100000;

static double step;

static double sum = 0.0;

static int part_step;

static class PITask extends FJTask {

int i = 0;

double sum = 0.0;

public PITask(int i) {

this.i = i;

}

public void run() {

double x = (i + 0.5) * step;

sum += 4.0 / (1.0 + x * x);

}

}

public static void main(String[] args) {

int i;

double pi;

step = 1.0 / (double) num_steps;

try {

FJTaskRunnerGroup g = new FJTaskRunnerGroup(Runtime.getRuntime()

.availableProcessors());

;

PITask[] tasks = new PITask[num_steps];

for (i = 0; i < num_steps; i++) {

tasks[i] = new PITask(i);

}

g.invoke(new FJTask.Par(tasks));

for (i = 0; i < num_steps; i++) {

sum += tasks[i].sum;

}

pi = step * sum;

System.out.println(pi);

;

System.out.println(Math.PI);

} catch (InterruptedException ie) {

}

}

}

 

First, we declare a run() method for PITask class similarly to what we did in a previous example. In this case, however, instead of calculating a partial sum for integration subrange, the PITask computes only single value x that corresponds to the i-th step. Then, we create an array of PITask objects, wrap it with FJTask.Par object and submit that array for execution by calling invoke() on FJTaskRunnerGroup object. Wrapping with FJtask.Par object instructs the framework to execute underlying array of tasks concurrently on a thread pool consisting of the number of threads which we’ve set to be equal to the number of processors. The invoke() method in this example blocks until all the tasks in the array are complete. This allows us to calculate the total sum immediately after that by extracting individual sum value from each specific task.

Note that this slightly modified version of Java p program doesn’t create any threads explicitly and does not do any work partitioning between threads and tasks. However, one may notice that it performs reasonably well even compared to the previous example where we did explicit work partitioning between threads. This is because the creation and asynchronous execution of the every new FJtask is almost as fast as calling a method. However, for the best performance results, it is still recommended to assign a reasonably large amount of work for each FJTask object to keep the number of these objects manageable. This will help to reduce the stress on the garbage collector within the JVM.


Choosing a parallel programming notation

In this paper, we have considered a wide range of the commonly used notations for parallel programming. The program used for this discussion is very simple; perhaps too simple. The hope, however, is that even with this simplicity, you can gain a good high level appreciation for each of the parallel programming notations.

These parallel programming notations vary in their complexity, how much they require the original serial program to change, and how error prone they are to work with. All of these factors must be considered in light of the types of parallel algorithms you intend to work with. In addition, you need to consider:

Portability: which platforms to you need to support? MPI is so popular in part because it runs everywhere. But if you only plan to support hardware with a shared address space, a notation based on threads may be a better match.

Performance: managed runtimes and high level runtime environments make a programmer’s life much easier. Given the high cost of creating and maintaining software, these benefits are very important to consider. But these benefits come at a cost. A low level API (s uch as windows threads, Pthreads, or MPI) in which the hardware is directly exposed to the programmer permits more detailed optimization. And if scalability to the last available core is needed, this optimization can be very important.

Serial and Parallel product releases: Software has a long life time. A successful software developer has a long legacy to support. Hence, it may be important to maintain a serial and parallel version of your software within a single source code tree. This is difficult to do if the parallel programming notation requires massive rewriting of the software to support parallelism.

Familiarity: Learning a new language can be difficult. When applied across a team of developers, the costs to learn an unfamiliar language can be overwhelming. Hence, a parallel notation that is an extension of a familiar serial language can be very important.

Testing: A software product must be extensively tested. In a professional development environment, the cost of testing easily outstrips the cost of creating the software in the first place. This is where an incremental strategy to parallel programming (common with OpenMP) can be important. With incremental parallelism, as each constructs is added, the developer can test the results and confirm that the results are still consistent with the original serial code.


References

 


About the Authors


Tim Mattson is a parallel programmer. Over the last 20 years, he has used parallel computers to make chemicals react, shake up proteins, find oil, understand genes, and solve many other scientific problems. Tim is determined to make sequential software rare so application programmers routinely write parallel software. For many years, he believed the answer was to find the right parallel programming environment. He worked with countless parallel programming environments and helped create a few (including OpenMP). After this approach proved less effective than he had hoped, he switched gears and decided that before we worry about languages and software tools, we need to understand how expert programmers think about parallel programming. To answer that question, Tim and his collaborators spent over five years developing a design pattern language for parallel programming (“Patterns for Parallel programming”, Addison Wesley, 2004). Currently at Intel, Tim is continuing to work on the parallel application programming problem in the Applications research lab in Intel’s corporate Technology Group

 

Andrey Y Chernyshev is a software engineer within the Enterprise Solutions Software Division of Intel. He is located in Russia and can be reached at andrey.y.chernyshev@intel.com.

 


Per informazioni più dettagliate sulle ottimizzazioni basate su compilatore, vedere il nostro Avviso sull'ottimizzazione.