So I've implemented a simple divide and conquer algorithn, mergesort to be exact, and added a cilk_spawn to one of the recursive calls and a cilk_sync before doing anything with the results of both recursive calls. Still when I try using two cores, it actually seems to be taking longer than with one core.
My code is just the main function and the mergesort function (Cilk parts in bold):
----------------------------------------------------------------------------------------------------
#include
#include
#include
#include
#define _N 100000000
using namespace std;
void mergesort(int a[], int l, int r);
int main()
{
srand(time(NULL));
int* a = new int[_N];;
for (int i = 0; i < _N; i++)
a[i] = (int)rand();
clock_t start = clock();
mergesort(a, 0, _N-1);
clock_t ticks = clock() - start;
cout << "Finished calculation in " << 1000.0*((double)ticks / (double)CLOCKS_PER_SEC) << " milliseconds!" << endl;
/*
for (int i = 0; i < _N; i++)
cout << a[i] << ", ";
cout << endl;
*/
delete[] a;
return 0;
}
// Sorts a. l and r are the left and right boundaries.
void mergesort(int a[], int l, int r)
{
if (l == r)
return;
// Get the middle.
int m = (l+r)/2;
// Sort left and right.
cilk_spawn mergesort(a, l, m);
mergesort(a, m+1, r);
cilk_sync;
// Merge. The left and right parts are already sorted. r-l is the final size of the merged blocks.
// j is used as an index for the left part, while k will be used for the right. i is for b.
int j = l;
int k = m+1;
int* b = new int[r-l+1];
for (int i = 0; i < r-l+1; i++)
{
// Left part is empty. Simply copy the right.
if (j > m)
b[i] = a[k++];
else
{
// Left part isn't empty, but the right one is. Simply copy the left.
if (k > r)
b[i] = a[j++];
else
{
// Both the left and right aren't empty. Compare the first elements.
if (a[j] <= a[k])
b[i] = a[j++];
else
b[i] = a[k++];
}
}
}
// Copy b into the appropriate part of a.
for (int i = 0; i < r-l+1; i++)
a[l+i] = b[i];
delete[] b;
}
----------------------------------------------------------------------------------------------------
Looking at the code, I assume there is a race condition concerning a if two threads are trying to write into it at the same time...so the output using cilk may be wrong (it works fine without)...still that can't be the cause of the problem?
Here is some sample output:
$ export LD_LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.9.293/compiler/lib/intel64/
$ export CILK_NWORKERS=1
$ ./main
Finished calculation in 39430 milliseconds!
$ export CILK_NWORKERS=2
$ ./main
Finished calculation in 49780 milliseconds!
Anyone have an idea of what I am overseeing here?
Cheers!





