# What am I not seeing here?

## What am I not seeing here?

So I've implemented a simple divide and conquer algorithn, mergesort to be exact, and added a cilk_spawn to one of the recursive calls and a cilk_sync before doing anything with the results of both recursive calls. Still when I try using two cores, it actually seems to be taking longer than with one core.

My code is just the main function and the mergesort function (Cilk parts in bold):

----------------------------------------------------------------------------------------------------
#include
#include
#include
#include

#define _N 100000000

using namespace std;

void mergesort(int a[], int l, int r);

int main()
{
srand(time(NULL));

int* a = new int[_N];;
for (int i = 0; i < _N; i++)
a[i] = (int)rand();

clock_t start = clock();
mergesort(a, 0, _N-1);
clock_t ticks = clock() - start;
cout << "Finished calculation in " << 1000.0*((double)ticks / (double)CLOCKS_PER_SEC) << " milliseconds!" << endl;

/*
for (int i = 0; i < _N; i++)
cout << a[i] << ", ";

cout << endl;
*/

delete[] a;
return 0;
}

// Sorts a. l and r are the left and right boundaries.
void mergesort(int a[], int l, int r)
{
if (l == r)
return;

// Get the middle.
int m = (l+r)/2;

// Sort left and right.
cilk_spawn mergesort(a, l, m);
mergesort(a, m+1, r);
cilk_sync;

// Merge. The left and right parts are already sorted. r-l is the final size of the merged blocks.
// j is used as an index for the left part, while k will be used for the right. i is for b.
int j = l;
int k = m+1;
int* b = new int[r-l+1];
for (int i = 0; i < r-l+1; i++)
{
// Left part is empty. Simply copy the right.
if (j > m)
b[i] = a[k++];
else
{
// Left part isn't empty, but the right one is. Simply copy the left.
if (k > r)
b[i] = a[j++];
else
{
// Both the left and right aren't empty. Compare the first elements.
if (a[j] <= a[k])
b[i] = a[j++];
else
b[i] = a[k++];
}
}
}

// Copy b into the appropriate part of a.
for (int i = 0; i < r-l+1; i++)
a[l+i] = b[i];

delete[] b;
}
----------------------------------------------------------------------------------------------------

Looking at the code, I assume there is a race condition concerning a if two threads are trying to write into it at the same time...so the output using cilk may be wrong (it works fine without)...still that can't be the cause of the problem?

Here is some sample output:

\$ export LD_LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.9.293/compiler/lib/intel64/
\$ export CILK_NWORKERS=1
\$ ./main
Finished calculation in 39430 milliseconds!
\$ export CILK_NWORKERS=2
\$ ./main
Finished calculation in 49780 milliseconds!

Anyone have an idea of what I am overseeing here?
Cheers!

17 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Have you tried using the Cilkscreen race detector to look for race conditions? How about Cilkview to figure out what the potential parallelism is?

When we ran Cilkview over the quicksort demos, we were shocked at the parallelism numbers it gave us. Much headscratching later,we realized that the partitioning of the data was a serial process and limiting how much speedup we could get.

Another possibility is that you could limit the depth to which you call cilk_spawn. Below a certain depth you're just adding overhead.

- Barry

Hi,

Could the problem be in how you are measuring the time?
On myLinux desktop, I modified your code touse "gettimeofday" to measure time, as follows:

#include

struct timeval start_t, end_t;
gettimeofday(&start_t, 0);

clock_t start = clock();
mergesort(a, 0, _N-1);
clock_t end = clock() - start;

gettimeofday(&end_t, 0);
double wall_time_in_ms = (end_t.tv_sec - start_t.tv_sec) * 1000.0 +
(end_t.tv_usec - start_t.tv_usec)/1000.0;
int P = __cilkrts_get_nworkers();
cout << "P = " << P << ", Wall time (ms) = " << wall_time_in_ms << endl;

I seem to be observing speedup:

\$CILK_NWORKERS=1 ./main
P = 1 Wall time (ms): 39262.9
Finished calculation in 39190 milliseconds!

\$ CILK_NWORKERS=2 ./main
P = 2 Wall time (ms): 20661
Finished calculation in 41180 milliseconds!

\$ CILK_NWORKERS=4 ./main
P = 4 Wall time (ms): 11876.1
Finished calculation in 47280 milliseconds!

I'm not too familiar with the std::clock() routine myself, but it seems like something in the timing calculation is off?

Cheers,

Jim

If the current process is multithreaded and more than one execution core is available, std::clock time may advance faster than wall clock.

So Jim's correct. std::clock() is the culprit.

- Barry

I've tested your code Jim and I see a performance increse from 40 secs
to around 24/25 secs. Thanks for your help guys. It also works using the
cilktime.h header files from the example.

Barry, I had already thought of the fact that I might be generating to
much overhead. From experience, would you say that it is something that
should definately be kept in mind when using cilk or can it be safely
ignored in the majority of cases?

As for using cilkview and cilkscreen...I'm not sure whats going on but I'm getting a some errors, such as

\$ cilkview \$PWD/main
cilkview: generating scalability data
~/Dokumente/Programming/cilkutil/bin/../lib64/cilkview-tool.so:
libelf.so.0: cannot open shared object file: No such file or directory
cilkview error: command exited with error condition: 255

I had to use \$PWD cause it was telling me that it couldn't find main for
some reason. Any idea if this error is common? I'll try fixing it by

Thanks so far guys, really liking the support in these forums!
Cheers,
Chris

> Barry, I had already thought of the fact that I might be generating to much overhead. From experience,
> would you say that it is something that should definately be kept in mind when using cilk or can it be
> safely ignored in the majority of cases?

You can probably ignore it in the majority of cases. The information displayed by Cilkview can help you understand when the burdened parallelism is out of proportion to the unburdened parallelism. The burdened parallelism includes the overhead of the cilk_spawns.

We're working on the libelf.so.0 issue and believe we have a fix but creating a symbolic link is the easiest workaround.

- Barry

Okay, thanks. Seems to be working now!
Cheers,
Chris

Chris,

One other thing to keep in mind.

It might be advisable to have a cutoff afterwhich point you do not use cilk_spawn.

You might find it to your advantage not to spawn more than (# threads) x N, where N might be on the order of 2:16. But this may vary depending on the order/disorder of the partitions.

Jim Dempsey

Blog: The Parallel Void

More generally, in your mergesort, it is often more efficient (even serially) to not recurse all the way down to a problem size of 1, but a larger granularity. In particular, instead of recursing down to the base case of l == r, recurse down until something like (r - l) < B, for a base-case parameter B, and then switch to a simpler base case method that sorts arrays of length B or less. I noticed this in your original code, but I thought you might be writing a simple test Cilk program, rather than trying to code mergesort as fast as possible?

I'm going to digress slightly and get on my soapbox for a bit. :)
Generally, I have found that the notions of "partitioning" and tuning your code based on the # of threads is often counterproductive in writing efficient and portable Cilk code.

For Cilk, you should try to AVOID choosing your base case parameter "B" based on the P --- the number of worker threads being used at runtime. Instead, you should often try to choose the parameter B such that the overhead all the "spawn" statements (e.g., in a serial execution with CILK_NWORKERS=1) is a sufficiently small fraction compared to the total amount of work in your program. In Cilk, the overhead of a "spawn" statement isafixed overhead that does not depend on the value of P, or grow as P increases.

Intuitively, I would argue the fundamental difference in these choices is the following. By choosing B based on P, you are tuning your code to expose only as much parallelism in your computation as the machine you are currently running on can exploit. By choosing B based on the overhead of a spawn, you are exposing as much parallelism as possible in your computation, with the constraint that the overhead of spawning is not increasing the total work in your computation by too much, compared to the time would normally spend to execute the code serially. The latter choice is often more robust, particularly in a multiprogrammed environment where some of the processors in your machine might be executing other programs / have other work to do.

When you write Cilk code for a function f() that makes decisions based on the value of P,you are attempting to make a decision about runtime scheduling, assuming that you know how many processors you have available to execute f(). In a large program, however, this may or may not be true.
For example, ina large program, sometimes,a program might executing f() by itself, with all P processors available, and other times, it might be executing another function which itself executes f() in parallel with another parallel function g(), and processors might need to be split efficiently between f() and g(). If your code for f() makes decisions based on thevalue ofP, then although it may work ok in the first case, it may be less efficient for the second case.

Said differently, by writing Cilk code that makes decisions based on the value of P, you may be interfering with the ability of Cilk's runtime scheduler to load-balance the computation using randomized work-stealing, a scheduling technique which is known to have provable guarantees on efficiency. Cilk has a runtime scheduler which is designed to do this load-balancing for you, once you have exposed as much parallelism in the computation as is reasonably possible (i.e., before the spawn overhead becomes too great.) Thus, attempting to do your own load-balancingand scheduling on top of Cilk can be counterproductive.

Of course, there are other parallel programming platforms / languages, where it may be appropriate or desirable to write code which explicitly depends on the value of P, in part because thesesystems may expect the programmer todosome of their own load-balancingand scheduling. But I believe it is important to remember that what works well for one platform may not necessarily work well for another.

Cheers,

Jim

Jim, well said.

Blog: The Parallel Void

Hi everybody,

Quoting Jim Sukha (Intel) More generally, in your mergesort, it is often more efficient (even serially) to not recurse all the way down to a problem size of 1, but
a larger granularity. In particular, instead of recursing down to the base case of l == r, recurse down until something like (r - l) < B,
for a base-case parameter B, and then switch to a simpler base case method that sorts arrays of length B or less.
I noticed this in your original code, but I thought you might be writing a simple test Cilk program, rather than trying
to code mergesort as fast as possible?..

To Jim Sukha:

I've spent a significant amount of time on optimization of different sorting algorithms. Based on my
experience MergeSort is a very simple in theory and implementation. However,I had to "fight" a lot with
optimizationsandfinally implemented a version of MergeSortthatoutperforms QuickSort and HeapSort
sorting algorithms on arrays larger than 8MB.

Ihavefour different versions of MergeSortincluding a version that at some threshold switches to a
QuickSort or HeapSort to sort sub-arrays and,based on my tests, results are negative.

Here is a list of genericoptimizations ( not Cilk based )that reallyimproveda performance of MergeSort:

- Using 'malloc-free' CRT-functionswhen sub-arrays are greater than128KB
Note: 'new-delete' C++ operators are decreasing performance

- Using 'alloca' CRT-function when sub-arrays are less than 128KB
Note: 'alloca' dynamically allocates an array on the stack and it will be automatically released as soon as
a recursive call is done

-A smaller part that merges two already sorted sub-arrays of some Level N

- Unrolling loops in a final'for' statement when copying a sub-array of a Level N to an array of Level N-1
with sizes up to 16KB

- Using 'memcpy' instead ofthe final 'for' statementwhen copying a sub-array of a Level N to an array of
a Level N-1 with sizes from 16KBup to 64KB, and using SSE-based'memcpysse' for sub-arrays with
sizes greater than 64KB

I'll provide some performance data for MergeSort, HeapSort and QuickSort algorithmsin a couple of days.

Best regards,
Sergey

alloca is great. However I just dealt with a bug report where a user had used alloca in a loop, something like:

for (int i = 0; i < some_number; i++) { int *block = (int *)alloca(size_of_block); }

Each of those calls to alloca will allocate a new block. In the case I was looking at, it blew out the stack. Never forget that memory allocated by alloca isn't returned until the function it's called in returns, not when you leave scope like a destructor.

- Barry

Yes, I would agree that are many optimizations that you can apply to make a sorting algorithm more efficient. Sorting is definitely one of those problems that can get quite complicated the more you study it. :)

In the context ofCilk, I believe many of the optimizations you are describing are ones that the original post authormight applyto the serial base case, which in turn will make the parallel algorithm more efficient. But as I mentioned, I'm not quite sure what the goal of author of the original post is --- is the mergesort just a toy example, or a real problem that the author needs a solution for?

I'd be curious to see how your various sorting implementations compare. My guess would be that quicksort tends to perform better (particularly on small input sizes) because it operates in-place. But perhaps there are other factors to consider as well? How do your implementations compare to standard library implementations like qsort?

Just as an aside, there is anotheralgorithmic choice to consider when one parallelizes amergesort or quicksort in Cilk. For large enough arrays, you may also want to parallelize the "merge" step (or partition, in the case of quicksort) to expose more parallelism in the algorithm.

• For a merge sort routine that sorts n elements, but performs the merge serially, consider what the work and span (critical path length) of the computation.
The work of mergesort is O(n lg n).In a mergesort which performs a serial merge, the span (i.e., the critical path length) is \Theta(n). [Roughly, span or critical path length is the time your computation would take if you had an infinite number of processors (with no overheads).] In mergesort, the largest merge takes\Theta (n) time because you are merging n elements serially. Conceptually, the parallelism of mergesort with a serial merge is O(n lg n) / \Theta(n) or O(lg n) parallelism.
• On the other hand, it is possible to implement a parallel merge algorithm for mergesort. Mergesort with a parallel merge ends up giving you an algorithm which performs O(n lg n) work and O(n / lg^2 n) parallelism. This algorithm willexpose significantly more parallelism. I'm not going to try to explain the algorithm here, because it is described in Section 27.3 of the 3rd Edition of "Introduction to Algorithms" by CLRS (Cormen, Leiserson, Rivest, and Stein), and if I try to repeat it, I won't do it justice.

It is possible to implement an efficient parallel merge in practice --- Ihave implemented similar algorithms in the past, and know of others who have as well. However, the base cases need to be large on parallel merge (or parallel partition for quicksort) before the gains you get from parallelism in a mergestart to outweigh the overhead compared to a serial implementation. In the past, I had observed that I needed something like at least 10^4, 10^5, or 10^6elements in a parallel merge before paralellizing that step becomes worthwhile. This number may be outdated now -- perhaps youneed even more?

• There is a similar issue with quicksort --- for large enough inputs, you may want to parallelize the partition methodas well. It might beeven harder to come up with a fast parallel partition method though. A serial algorithm can partition in-place, but most all the parallel partitioning algorithms that I'm aware of require extra space.

Cheers,

Jim

Quoting Barry Tannenbaum (Intel) alloca is great. However I just dealt with a bug report where a user had used alloca in a loop, something
like:

for (int i = 0; i < some_number; i++)
{
int *block = (int *)alloca(size_of_block);

}
...

Hi Barry,

Thank you for the example of a wrong application of 'alloca' CRT-function. It is really interesting!

In my version of MergeSort algorithm 'alloca'is used as follows ( pseudo code without many details ):

...
void MergeSort( data, size )
{
...
MergeSort( data[0], size / 2 );
MergeSort( data[size/2], size / 2 );
...
MergeSortHelper( data, low, mid, high );
}

void MergeSortHelper( data, low, mid, high )
{
*datatemp = alloca( high ); //Allocates memory on the stack and it is outside of all the rest 'while' and 'for' statements
...
// Merges sub-arrays ( 'while'-'for'-'for' )
//Copies merged sub-arrays ( final 'for' )
}
...

Best regards,
Sergey

Hi Jim,

Thank you for comments regardingparallelization of sorting algorithms!

Quoting Jim Sukha (Intel)

Yes, I would agree that are many optimizations that you can apply to make a sorting algorithm more efficient. Sorting is definitely one of those problems that can get quite complicated the more you study it. :)

[SergeyK] Data sets has grown significantly andcontinue togrow.I think in a couple of years nobody
will be surprised with a data set that has1GBof elements, ormore...
...
I'd be curious to see how your various sorting implementations compare. My guess would be that quicksort tends to perform better (particularly on small input sizes) because it operates in-place. But perhaps there are other factors to consider as well? How do your implementations compare to standard library implementations like qsort?

[SergeyK] That's a good question and I try to compare.

Best regards,
Sergey

Hi Sergey.

I didn't mean to imply that you were using alloca incorrectly. I just wanted to point out a subtlety of alloca that a lot of people forget. When I pointed it out to the one of the other Cilk developers, he said, " I know that, and I still didn't see the bug."

- Barry

Ascreenshotdemonstrates relative performance of Pigeonhole, Merge and Quick sorting algorithms
on acomputer with one CPU: