Recursive Quicksort lab using intel openmp taskq

Recursive Quicksort lab using intel openmp taskq

I wanted to bounce some ideas for a new lab off the folks here in this forum.

Before I do - a little background is in order.

Our new Hands on Integrated architecture module (3.5 hr) is broken down into three major sections - one of which is MC achitecture. The goal of this module is to lay out the architecture but then also show how students can take advantage of it. In the MC section we introduce Openmp as a way of taking advantage of multiple cores after we have laid out MC arch (hi level view) plus memory architecture. One topic we are introducing in Openmp is the concept of work queuing implemted in this lab w intel omp taskq. In the upcoming Openmp 3.0 there will be similar functionality added to the openmp spec - not just an Intel extension as we have now - so I added a lab on work queuing.

I have several labs cooked up already to show how to use work queuing. One is a simple linked list pointer chasing example:

while(p != NULL){
do_work(p->data);
p = p->next;
}

A parallel version of the above looks as follows:

#pragma omp parallel
#pragma intel omp taskq
{
while (p != NULL) {
#pragma intel omp task captureprivate(p)
processwork(p);
p = p->next;
}
}

This example is easy to parallelize using intel work queueing and I think we will keep it as part of the module

However - a slightly more difficult and more interesting example of using work queuing is to parallelize a recursive version of quicksort

I'd like to get feedback on how valuable this example may be. Also - if anyone has a more scalable or better approach to this problem of quicksort - I'd love to see it. As usual, in our labs, we have the serial version presented to student and then a solution folder that has "a solution" to the problem.

Below is the serial version:

void quicksort (int a[], int low, int high)
{
// low = lower index, high = higher index
int i=low, j=high, temp;
int x=a[(low + high) / 2];

// partition
do
{
while (a[i] while (a[j]>x) j--;
if (i<=j)
<
SPAN> {
temp=a[i]; a[i]=a[j]; a[j]=temp;
i++; j--;
}
} while (i<=j);

// recursion
if (low < j) quicksort(a, low, j);
if (i < high) quicksort(a, i, high);
}

The approach I took to parallelize this with taskq was to create a wrapper to quicksort that keeps track of how many levels of recursion or task depth that quicksort has been invoked for. The wrapper also decides, based on task depth, whether to call the parallel orserial version of quicksort. I use this task depth to govern a simple application level thread scheduler that creates a threadedtask (on a separate thread) if the task depth is shallow, but uses serial code oncethe task depth goes above a threshold.This way I create threaded tasks early in the life of the app and each of these tasks get a fair bit of work that can be done in parallel - and we avoid creating one a separate threaded task to simply swap two numbers.

Here is the essence of my parallel version - changes to serial version are bold. Note that the quicksort_threaded is very similar to the serial version but with intel omp task clauses added

void quicksort_scheduler(int a[],int low, int high, int taskq_depth);

void quicksort_wrapper(int a[], int low, int high, int taskq_depth);

void quicksort (int a[], int low, int high)

{ //serial version

// low = lower index, high = higher index

int i=low, j=high, temp;

int x=a[(low + high) / 2];

// partition

do

{

while (a[i]

while (a[j]>x) j--;

if (i<=j)

{

temp=a[i]; a[i]=a[j]; a[j]=temp;

i++; j--;

}

} while (i<=j);

// recursion

if (low < j) quicksort(a, low, j);

if (i < high) quicksort(a, i, high);

}

void quicksort_threaded (int a[], int low, int high, int taskq_depth)

{

// low = lower index, high = higher index

int i=low, j=high, temp;

int x;

#pragma intel omp taskq

{

x=a[(low + high) / 2];

// partition

do

{

while (a[i]

while (a[j]>x) j--;

if (i<=j)

{

temp=a[i]; a[i]=a[j]; a[j]=temp;

i++; j--;

&
nbsp; }

} while (i<=j);

// recursion

#pragma intel omp task

{

if (low < j) quicksort_wrapper(a, low, j, taskq_depth + 1);

}

#pragma intel omp task

{

if (i < high) quicksort_wrapper(a, i, high, taskq_depth + 1);

}

}

}

void quicksort_wrapper(int a[], int low, int high, int taskq_depth) {

if (taskq_depth < TASKQ_DEPTH) {

quicksort_threaded (a, low, high, taskq_depth+1);

} else {

quicksort(a, low, high);

}

}

int main(int argc, char *argv[]) {

double start, end;

init_array(a);

start = wcgettimeofday();

#pragma omp parallel

#pragma intel omp taskq

{

&nbs
p; #pragma intel omp task

quicksort_wrapper(a, 0, N-1, 0); // exgtra parameter here intialized to 0 keep track of task depth

}

end = wcgettimeofday();

print_array(a);

printf("Compute Time: %f seconds
", end - start);

return 0;

}

The algorithm, as I coded it, does NOT scale well - but it will show a speedup on a dual core box. I tested on our windows dual core laptops used for classroom labs and got 30 seconds wall clock time for serial code versus 21 seconds for parallel version on two cores. I also tested on our 2x4 core Kentsfields and saw the serial execution of 23 seconds go to 7 seconds for parallel version using 8 cores.

So - performance improves on both 2 core and 8 core systems - but it does NOT scale - do you have a better solution using openmp that does (any approach even not demonstrating work queuing)? If so - please submit it to this post.

bob c

2 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Bob,

Interesting problem. Hopefully someone out there has a better solution for you!

Jerry

Connectez-vous pour laisser un commentaire.