a program for the sole purpose of trying to demonstrate the advantage of using 4 cores simultaneously is below.
however, it runs for 90 seconds on a 4 core xeon (3ghz) versus 2 seconds on a single core machine.
any hints greatly appreciated.
[code section excised for sanity]
compiled using 'gcc -O3 -fopenmp workshare2.c -o workshare2' on gcc 4.3.2 on opensuse64 11.1
The core of your problem is probably here:
#pragma omp parallel for private(i,j,k) schedule (static,chunk)
for (i=0; i < N; i++) {
for (j = 0; j<200000; j++) {
k = rand();
}
// c[i] = a[i] + b[i];
}
Though rand() not required to be reentrant and therefore not required to be thread safe (see http://www.opengroup.org/onlinepubs/000095399/functions/rand.html), the fact is that some implementations provide thread safety by putting a lock in the function, which probably means that all those parallel invocations of rand() from the various threads are being serialized. That could go a long way to explaining the slowdown you report.
For future reference, you might consider timing just the code you're testing for parallel performance, rather than including the serial initialization section as part of the timed section as is done in this example.