# Recursive Quicksort lab using intel openmp taskq

## Recursive Quicksort lab using intel openmp taskq

I wanted to bounce some ideas for a new lab off the folks here in this forum.

Before I do - a little background is in order.

Our new Hands on Integrated architecture module (3.5 hr) is broken down into three major sections - one of which is MC achitecture. The goal of this module is to lay out the architecture but then also show how students can take advantage of it. In the MC section we introduce Openmp as a way of taking advantage of multiple cores after we have laid out MC arch (hi level view) plus memory architecture. One topic we are introducing in Openmp is the concept of work queuing implemted in this lab w intel omp taskq. In the upcoming Openmp 3.0 there will be similar functionality added to the openmp spec - not just an Intel extension as we have now - so I added a lab on work queuing.

I have several labs cooked up already to show how to use work queuing. One is a simple linked list pointer chasing example:

while(p != NULL){
do_work(p->data);
p = p->next;
}

A parallel version of the above looks as follows:

#pragma omp parallel
{
while (p != NULL) {
processwork(p);
p = p->next;
}
}

This example is easy to parallelize using intel work queueing and I think we will keep it as part of the module

However - a slightly more difficult and more interesting example of using work queuing is to parallelize a recursive version of quicksort

I'd like to get feedback on how valuable this example may be. Also - if anyone has a more scalable or better approach to this problem of quicksort - I'd love to see it. As usual, in our labs, we have the serial version presented to student and then a solution folder that has "a solution" to the problem.

Below is the serial version:

void quicksort (int a[], int low, int high)
{
// low = lower index, high = higher index
int i=low, j=high, temp;
int x=a[(low + high) / 2];

// partition
do
{
while (a[i] while (a[j]>x) j--;
if (i<=j)
<
SPAN> {
temp=a[i]; a[i]=a[j]; a[j]=temp;
i++; j--;
}
} while (i<=j);

// recursion
if (low < j) quicksort(a, low, j);
if (i < high) quicksort(a, i, high);
}

Here is the essence of my parallel version - changes to serial version are bold. Note that the quicksort_threaded is very similar to the serial version but with intel omp task clauses added

void quicksort_scheduler(int a[],int low, int high, int taskq_depth);

void quicksort_wrapper(int a[], int low, int high, int taskq_depth);

void quicksort (int a[], int low, int high)

{ //serial version

// low = lower index, high = higher index

int i=low, j=high, temp;

int x=a[(low + high) / 2];

// partition

do

{

while (a[i]

while (a[j]>x) j--;

if (i<=j)

{

temp=a[i]; a[i]=a[j]; a[j]=temp;

i++; j--;

}

} while (i<=j);

// recursion

if (low < j) quicksort(a, low, j);

if (i < high) quicksort(a, i, high);

}

{

// low = lower index, high = higher index

int i=low, j=high, temp;

int x;

{

x=a[(low + high) / 2];

// partition

do

{

while (a[i]

while (a[j]>x) j--;

if (i<=j)

{

temp=a[i]; a[i]=a[j]; a[j]=temp;

i++; j--;

&
nbsp; }

} while (i<=j);

// recursion

{

if (low < j) quicksort_wrapper(a, low, j, taskq_depth + 1);

}

{

if (i < high) quicksort_wrapper(a, i, high, taskq_depth + 1);

}

}

}

void quicksort_wrapper(int a[], int low, int high, int taskq_depth) {

} else {

quicksort(a, low, high);

}

}

int main(int argc, char *argv[]) {

double start, end;

init_array(a);

start = wcgettimeofday();

#pragma omp parallel

{

&nbs

quicksort_wrapper(a, 0, N-1, 0); // exgtra parameter here intialized to 0 keep track of task depth

}

end = wcgettimeofday();

print_array(a);

printf("Compute Time: %f seconds
", end - start);

return 0;

}

The algorithm, as I coded it, does NOT scale well - but it will show a speedup on a dual core box. I tested on our windows dual core laptops used for classroom labs and got 30 seconds wall clock time for serial code versus 21 seconds for parallel version on two cores. I also tested on our 2x4 core Kentsfields and saw the serial execution of 23 seconds go to 7 seconds for parallel version using 8 cores.

So - performance improves on both 2 core and 8 core systems - but it does NOT scale - do you have a better solution using openmp that does (any approach even not demonstrating work queuing)? If so - please submit it to this post.

bob c

2 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Bob,

Interesting problem. Hopefully someone out there has a better solution for you!

Jerry