openmp generates large overhead in kernel32.dll(SleepEx)

openmp generates large overhead in kernel32.dll(SleepEx)

I'm doing a project about image processing using openmp.
I have a simple code as follows.
The program ran smoothly on my linux platform with gcc4.3.3.
But the program ran incredibly slow on xp platform(visual studio 2005 with Parallel studio 2011).
After running some hotspot analysis, the bottleneck was SleepEx in kernel32.dll

any idea ?




unsigned char **a_data,
**b_data,
**c_data,
*p,
*p_a,
*p_b,
*p_c;
unsigned long nr,
nc;
nr = nc = 64;

a_data = (unsigned char **) malloc(nr*sizeof(unsigned char *));
p = (unsigned char *) malloc(nr*nc*sizeof(unsigned char));
for(int i=0; i{
a_data[i] = p + i*nr;
}
b_data = (unsigned char **) malloc(nr*sizeof(unsigned char *));
p = (unsigned char *) malloc(nr*nc*sizeof(unsigned char));
for(int i=0; i{
b_data[i] = p + i*nr;
}
c_data = (unsigned char **) malloc(nr*sizeof(unsigned char *));
p = (unsigned char *) malloc(nr*nc*sizeof(unsigned char));
for(int i=0; i{
c_data[i] = p + i*nr;
}

for(int i=0; i{
p_a = a_data[i];
p_b = b_data[i];
p_c = c_data[i];
#pragma omp parallel for
for(int j=0; j {
p_a[j] = p_b[j] + p_c[j];
}
}
2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Your parallel for loop is too small to perform anything useful in a parallel manner. Theiteration space is nc=64 and the work performed is the addition of 2 char values.

If you enclosed the posted code into a subroutine, then timed many calls to this subroutine, then the preponderance of the time will be in the malloc (preceeding your loop).

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today