Scalability issue with fully parallel code

Scalability issue with fully parallel code

Hi there!

I've been through some experiment on Xeon Phi recently and I'm hitting a serious scalability issue. The idea is to run a fully parallel-code with no memory access and with a growing number of processors, all running the exact same amount of operations. The workload is fixed, so therefore we should expect a constant execution time. Here is the code we are running:

int main(int argc, char *argv[])
{
    int i = 0; struct timespec start, end; uint64_t total; FILE *fd = stderr;

   // Make sure the OpenMP threads are started before we start our time calculation
   #pragma omp parallel
   {
      fprintf (stdout, "parallel %d\n", omp_get_thread_num ());
   }

   // Start the experiment
   clock_gettime (CLOCK_REALTIME, &start);
   #pragma omp parallel
   {
    for ( i = 0 ; i < 1024*1024*128 ; i++ )
    {
        asm volatile(
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      "addl $1,%%eax;\n\t"
      :
      :
      :"%eax"
   );
    }
   }
   // Stop the experiment
   clock_gettime (CLOCK_REALTIME, &end);

   total = (end.tv_sec * 1e9 + end.tv_nsec)
         - (start.tv_sec * 1e9 + start.tv_nsec);

   fprintf (fd, "time = %llu\n", (unsigned long long) total);
    return 0;
}

Here is some info about our environment

  • MPSS Version:mpss_gold_update_3-2.1.6720-19   (released:   September 10 2013)
  • KMP_AFFINITY= scatter: This avoids possible hardware stalls due to HyperThreading contention. We want our code to be fully parallel.
  • Number of OpenMP threads set via the OMP_NUM_THREADS environment variable

Here are the results we got
For 4 cores, in avg = 21.09s (with 25 runs)
For 8 cores, in avg = 43.29s (with 25 runs)
For 12 cores, in avg = 65.37s (with 25 runs)
For 16 cores, in avg = 87.24s (with 25 runs)
For 20 cores, in avg = 109.95s (with 25 runs)
For 24 cores, in avg = 132.18s (with 25 runss)
For 28 cores, in avg = 152.79s (with 25 runs)
For 32 cores, in avg = 175.32s (with 24 runs)
For 36 cores, in avg = 196.47s (with 24 runs)
For 40 cores, in avg = 218.72s (with 24 runs)
For 44 cores, in avg = 241.10s (with 24 runs)
For 48 cores, in avg = 263.49s (with 24 runs)
For 52 cores, in avg = 285.33s (with 24 runs)
For 56 cores, in avg = 307.35s (with 24 runs)

We clearly see the lack of scalability here. So my question is: Are these numbers normal to you?

Jp

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

I think you should declare the variable "i" inside the parallel. The way your code is at the moment it will be shared, which is not what you want...

It was indeed the problem. I thank you for your time and apologize for the inconvenience.

No problem at all. I'm glad the fix was that simple!

I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private.

This mistake is common and it is unfortunate that the OpenMP syntax allows it.

John D. McCalpin, PhD
"Dr. Bandwidth"

If discussing "unfortunate" features, Cilk(tm) Plus allows a default shared cilk_for loop index for a .c file but not for a .cpp source file.  This is probably considered too obvious to document, but still a point on which mistakes are easily made.

"I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private." 

Indeed true, however the semantics would also have been completely different! (Sharing a fixed amount of work between the threads as against doing all the work in each thread).

My personal preference (even in non-OpenMP code) is to declare and initialise variablesin C/C++  where they are first required unless they need a wider scope. That normally avoids the need to specify that they are "private" if/when you add OpenMP directives.

Citação:

John D. McCalpin escreveu:

I believe that if the pragma had been "omp parallel for" instead of just "omp parallel", the for loop index would have automatically been treated as private.

This mistake is common and it is unfortunate that the OpenMP syntax allows it.

This is exactly what happened in this case. I am used to use the "parallel for" directive a lot, and I automatically assumed the induction variable to be private in this case. I won't forget this trick twice ! Again, I am deeply sorry for this unfortunate, simple mistake.

Another reason for using for(int i=... is that then the compiler optimization code will then know that "i" exits scope after the for statement. Meaning better opertunities for optimization (registerization) of i.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today