poor performance computing

poor performance computing

Hello,

Recently, I started learning TBB. I tried to implement RBF calculation using the following program. Unfortunately, but working on 4 cores obtained performance similar to or worse than serial programs. I will be grateful for a hint on how to improve the program.

Thanks,
Stan

#include <iostream>
#include "parallel_1.h"
#include <tbb\tick_count.h>
#include <tbb\parallel_reduce.h>
#include <tbb\task_group.h>
#include <mkl_vml.h>
#include <mkl_blas.h>

#define GRAIN_SIZE 1000
#define N 100
#define M 200
const int DIM = 200*100;

class Norm2 {
      float *_a, *_b;
      double _ss;
   public:     
      Norm2(float *a, float *b) : _a(a), _b(b), _ss(0) {}
      Norm2( Norm2& x, split ) : _a(x._a), _b(x._b), _ss(0) {}     
      void operator( )( const blocked_range<size_t>& r )
      {
         for( size_t i=r.begin(); i!=r.end( ); i++ ){
            _ss += (double)(_b[i]-_a[i])*(_b[i]-_a[i]);
         }  
      }
      void join( const Norm2& y ) { _ss += y._ss; }
      double resut(){ return sqrt(_ss); }
};

class Gauss{
   int _n;
   float *_x, *_y;
   float *_a;
public:
   Gauss(int n, float *x, float *a, float *y)
      : _n(n), _x(x), _a(a), _y(y) { }

   void operator()(const blocked_range<size_t>& r) const {  
      for(size_t i =r.begin(); i<r.end(); i++) {
         Norm2 d(_x, &_a[_n*i]);
         parallel_reduce(blocked_range<size_t>(0, _n, GRAIN_SIZE), d, auto_partitioner());
         _y[i] = (float)exp(-d.resut()*d.resut()/2.0);
      }     
   }
};

void serialGauss(float* x, float *a, float *y, int m, int n )
 {
   double ss;
    for(int i=0; i<m; i++){
       ss = 0;
       for(int j=0; j<n; j++)
         ss +=(a[n*i+j]-x[j])*(a[n*i+j]-x[j]);
       ss = sqrt(ss);
       y[i] = (float)exp(-ss*ss/2.0);
    }
 }
 
void mklGauss(float* x, float *a, float *y, int m, int n )
 {
   float ss;
   int one = 1;
   float *v = (float*)malloc(n*sizeof(float));
   for(int i=0; i<m; i++) {
      vsSub(n, x, &a[m*i], v);
      ss = snrm2(&n, v, &one);
      y[i] = (float)exp(-ss*ss/2.0);
    }
   free(v);
 }

int main(int argc, char *argv[])
{
   int i, j;
   float a[DIM];
   float x[N];
   float y[M];
   int m = M;
   int n = N;

    for(i=0; i<M; i++) {
      for(j=0; j<N; j++)  a[N*i+j] = 2.5f;
    }  
   for(i=0; i<N; i++) x[i] =2.2;

   tick_count t0, t1;
   t0 = tick_count::now();  
   serialGauss(x, a, y, M, N);
   t1 = tick_count::now();
   printf("serial:   %.3g msec\n", (t1-t0).seconds()*1000);
   for(i=0; i<10; i++) printf("%8.4f", y[i]);
   printf("\n");
   //   
   for(i=0; i<M; i++) y[i] =0;
   t0 = tick_count::now();  
   mklGauss(x, a, y, M, N);
   t1 = tick_count::now();
   printf("mkl:     %.3f msec\n", (t1-t0).seconds()*1000);
   for(i=0; i<10; i++) printf("%8.4f", y[i]);
   printf("\n");
   //   
    t0 = tick_count::now();  
   parallel_for(blocked_range<size_t>(0, m, GRAIN_SIZE),  Gauss(n, x, a, y), auto_partitioner());
    t1 = tick_count::now();
   printf("parallel: %.3f msec\n", (t1-t0).seconds()*1000);
   for(i=0; i<10; i++) printf("%8.4f", y[i]);
   printf("\n");
  
   return 0;
}

serial:   0.899 msec  
0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111
mkl:     19.049 msec  
0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111
parallel: 0.967 msec  
0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111  0.0111

 

18 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Hello, Stansy,

I am not as familiar with the various partitioners and their effects as some, but I can give you part of the answer.

The section on chunking in the Threading Building Blocks Documentation gives this caveat:

Caution

Typically a loop needs to take at least a million clock cycles to make it worth using parallel_for. For example, a loop that takes at least 500 microseconds on a 2 GHz processor might benefit from parallel_for.

The test you are running takes about 90 microseconds serially, so the amount of work in the loop makes using a parallel loop construct not very effective.

The version of the loop you made in the example has two dimensions of parallelism (the outer parallel_for and the inner parallel_reduce.)  This will break the work up into smaller chunks, increasing the parallel overhead.

The question I do not know the answer to is the combination of specifying the chunk-size in the ranges and specifying auto_partitioner in the parallel_for and parallel_reduce.  I have the feeling the grainsize is ignored by the auto_partitioner, but I'd have to look at the code to make sure.

I am attaching a version of the program with some changes.  I don't have MKL installed on the machine I'm using, so I commented that loop out.  I am also running each loop five times and averaging the time spent; this reduces the variance caused by machine load.

par1_gauss is a loop body that does the simplest decomposition of the problem; it only uses parallel_for and does the inner reduction serially.  This has reasonable performance even for the problem size you originally specified.  The test program runs it with auto_partitioner and with a series of chunksizes from 1 to 64.

The last set of loops runs the TBB version you wrote, but with a grainsize passed in for the parallel_reduce, as well as grainsizes specified for the outer parallel_for.  (It also runs both the inner and outer construct with the auto_partitioner.)

I am not taking into account cache effects (the tests are mostly, if not all, run with hot caches).

I hope this answer helps you; if you have any questions, please don't hesistate to ask.

Regards,
Chris

Fichiers joints: 

Fichier attachéTaille
Télécharger test-gaussian.cpp6.12 Ko

Building on Christopher's analysis:

auto_partitioner obeys is_divisible(), which obeys grainsize, which shouldn't be called chunksize because a chunk is a subrange that is submitted by the parallel construct to a Body for execution. See elsewhere for recommendations about good grainsize values: I always forget if it's 10000 or 100000 machine instructions, which then has a complicated relation to number of elements that essentially depends on the particular algorithm, and you must also take into account that grainsize is a lower limit on the double of chunk size (given a large-enough initial range).

Nesting auto_partitioner-based parallel constructs defeats the purpose, but unfortunately it is currently the responsibility of the programmer to avoid such a situation. It is not specified how much hyperthreading exists on those 4 cores, but the more parallel a machine the more overhead auto_partitioner causes on insufficiently large initial ranges, which will almost certainly be the case in the inner construct even if not in the outer one. Quite likely you have 8 hardware threads, and if I'm not mistaken that will make auto_partitioner try to execute about 128 chunks in the outer construct, and 16384 in the inner construct, clearly not an ideal situation. With, e.g., 48 hardware threads, not unrealistic these days, that becomes 768 and 589824.

So the advice is twofold: never ever nest auto_partitioner, and, if you do and/or if your initial range may not be sufficiently large (also think about highly parallel machines), choose grainsize carefully.

Hi Chris and Raf,

Thank you very much for the interesting comments and suggestions. It follows that it is very important to choose the right particle size. For simple procedures, it is easy even for novice programmers. In the case of nested parallelism is not so obvious and easy selection. I think so, although I do not practice. Interesting observations include on this topic, I found in the book "Intel Threading Building Blocks" on page 37: "Set the grainsize parameter higher than Necessary. Setting it to 10,000 is usually a good starting point."  However, in the case of my first tests this rule does not work completely.
I attach a simple program, and I have a question regarding the use of lambda expressions:
Do I need to declare a partitioner and where to do it in a lambda expression?

Regards,
Stan

Fichiers joints: 

Fichier attachéTaille
Télécharger rbf-test.cpp633 octets
Télécharger rbfnet.h4.73 Ko

Apparently the grainsize advice in the online User Guide is for about 100,000 instructions (section "Controlling Chunking"), but I couldn't get the Search function to work to see whether the advice about "a good starting point" is still there somewhere.

The default auto_partitioner is usually a good choice, but you can always specify another one as an additional argument to the template function, as specified in the Reference Manual. In this case your need is really for a grainsize parameter, which might as well be infinite in inner loops, i.e., using serial execution instead.

It would seem to be beneficial for an auto_partitioner to automatically detect whether it is being used in an inner loop. Maybe the TBB team could comment whether this has been tried yet? It would probably involve a context-specific variable, with context managed on the program's logical stack (as opposed to a specific worker's stack), which seems an interesting topic in and of itself (unless I missed something).

Stan,

Please review mklGauss function. You have two calls to Heap memory management functions, that is malloc and free. So, your performance evaluations are inaccurate and call overhead to these functions is significant.

Thanks.

Calling malloc and free can be very expensivein terms of cpu cycles.Beside the call/ret instructions overhead there is also OS dependent implementation which can be linked to heap allocation/deallocation routines.Moreover a block of memory must be allocated from the heap and marked as beign in use.This operation can take significant time to complete.

I have read somewhere that calling malloc 100k times and allocating and freeing 512 slots of memory of length 1-163840 bytes took on average 480 cycles to complete(it was    measured with rdtsc).

Building on Sergey's analysis (sorry for not having done my own research about the program):

Please consult the documentation about TBB's scalable memory allocator.

Thank you all once again for the time spent and interesting discussions. I can not agree with all the comments. In this example, the allocation vector has no significant impact on the performance of the method with the functions of MKL. A few months ago I read dissertations work in which the author used the MKL functions for calculations using a feedforward NNs - I do not remember the details. I'm not a fan of this idea but I wanted to check it out. I think, such algorithm is inefficient because each step requires the use of a separate MKL function.
In our deliberations we missed the most important fact - the result :). Unfortunately, I wrote a program that returns different values depending on grain size. Because I could not solve this problem yet I will be grateful for any help.
The attached example shows that the resulting speed-up for the MKL, OpenMP, and TBB is as follows: 1, 5, 13.

Regards,
Stan

The problem was solved.

Fichiers joints: 

Fichier attachéTaille
Télécharger rbf-test.cpp969 octets
Télécharger rbfnet.h8.9 Ko

The problem was solved, but still do not know how to fully use parallelism.

>>...Because I could not solve this problem yet I will be grateful for any help.
>>The attached example shows that the resulting speed-up for the MKL, OpenMP, and TBB is as follows: 1, 5, 13.

One more question. I'd like to confirm that MKL version of your processing 13x slower than TBB version, is that correct?

Yes, it is true. Of course, only used a neural network architecture: inputs 1000, outputs 4, neurons 2000.

Stan,

I see that in mklTrain you call serialForwardPropagate, as well as in in serialTrain, while mklForwardPropagate seems not called. Is it expected?

Thank you very much Alexandr. You have absolutely right. Fixed the bug in the program and correct speed-up for MKL, OpenMP and TBB is as follows: ~x70, x5, x13.

Stansy,

You are trying to use too much parallelism. I would strongly suggest that you .NOT. use nested parallelism. Example for omp (I will let you modify remainder)

void ompForwardPropagate(const float *p)
{
  int i, j, k;
  float sum;
  memcpy(_p, p, _np*sizeof(float));
  #pragma omp parallel default (none) private(i, j, k, sum)
  {
    // remove "parallel" on following statement (you are inside your parallel region)
    #pragma omp for 
    for(j = 0; j < _nn; j++) {
      sum = 0.f;
      // removed #pragma omp parallel for...
      for(k=0; k<_np; k++)
        sum += (_ct[_np*j+k]-_p[k])*(_ct[_np*j+k]-_p[k]);
      sum = sqrt(sum);
      _z[j] = sum;
      _s[j] = gauss(sum, _sigma[j]);    
    }          
   
    // remove "parallel" on following statement (you are inside your parallel region)
    #pragma omp for 
    for(i=0; i<_nt; i++){
      sum = 0;
      // removed #pragma omp parallel for...
      for(j = 0; j < _nn; j++)     
        sum += _wt[_nn*i+j]*_s[j];
      _t[i] = purelin(sum + _b[i]);
    }
  } // #pragma omp parallel
} // ompForwardPropagate

Your original code had 3 nest levels:

#pragma omp parallel default (none) private(i, j, k, sum)
{
  // 1st level
  //          vvvvvvvv Note parallel
  #pragma omp parallel for 
  for(j = 0; j < _nn; j++) {
    // 2nd level
    //          vvvvvvvv Note parallel
    #pragma omp parallel for reduction(+: sum)
    for(k=0; k<_np; k++)
 sum += ... // 3rd level
    // back to 2nd level
  }
  // back to 1st level
  //          vvvvvvvv Note parallel
  #pragma omp parallel for 
  for(i=0; i<_nt; i++){
    // 2nd level
    //          vvvvvvvv Note parallel
    #pragma omp parallel for reduction(+: sum)
    for(j = 0; j < _nn; j++)     
        sum += ... // 3rd level
    // back to 2nd level
  }
  // back to 1st level
}
// back to main level

Note, the proper way to nest this is (two methods)

a) keep outer region, remove "parallel" from next region

#pragma omp parallel default (none) private(i, j, k, sum)
{
  // 1st level (all threads running here)
  //          vvvvvvvv Note lack of parallel == partition using threads of current region
  #pragma omp for 
  for(j = 0; j < _nn; j++) {
    // still 1st level (but in context of 1 thread of 1st level)
    //          vvvvvvvv Note parallel kept to create new nest level
    #pragma omp parallel for reduction(+: sum)
    for(k=0; k<_np; k++)
 sum += ... // 2rd level
    // back to 1st level
  }
  // continue in 1st level
  //          vvvvvvvv Note lack of parallel
  #pragma omp for 
  for(i=0; i<_nt; i++){
    // still 1st level (but in context of 1 thread of 1st level)
    //          vvvvvvvv Note parallel kept to create new nest level
    #pragma omp parallel for reduction(+: sum)
    for(j = 0; j < _nn; j++)     
        sum += ... // 2nd level
    // back to 1st level
  }
  // continue in 1st level
}
// back to main level

or b) Remove outer region, keep "parallel" on next

// remove outer parallel region
{
  // main level
  //          vvvvvvvv Note parallel
  #pragma omp parallel for 
  for(j = 0; j < _nn; j++) {
    // 1st level
    //          vvvvvvvv Note parallel
    #pragma omp parallel for reduction(+: sum)
    for(k=0; k<_np; k++)
 sum += ... // 2nd level
    // back to 1st level
  }
  // back to main level
  //          vvvvvvvv Note parallel
  #pragma omp parallel for 
  for(i=0; i<_nt; i++){
    // 2nd level
    //          vvvvvvvv Note parallel
    #pragma omp parallel for reduction(+: sum)
    for(j = 0; j < _nn; j++)     
        sum += ... // 3rd level
    // back to 2nd level
  }
  // back to 1st level
}
// back to main level

**** Note,
Use of nested regions in this program is likely counter productive.

Jim Dempsey

www.quickthreadprogramming.com

Hi Jim,

Thank you very much for taking the time and you present a clear basis for the use of parallelism. Using your guidance makes it easy to increase 4x speed of computation for OpenMP (speedup ~20). One note: I want to implement nested parallelism because I do not know what network architecture will be used. For example, a simple regression requires one output while filtering one input. For this program, we can skip the reduction without loss of efficiency but for the other networks, it may be significant.
In the second method, you missed declarations of private variables for instructions #pragma omp parallel for

Regards,
Stan

Thanks for the catch on the omission of the private. You could also fix by scoping the loop control variable

for(int i=0;...

In the case where _nt is very small (1, 2, 3) and _nn is reasonably large consider inverting the loops:

[/cpp]

...
switch(_nt)
{
case 1:
  {
    // _nt = 1
    sum = 0; 
    #pragma omp for reduction(+: sum)
    for(int j = 0; j < _nn; j++)      
      sum += _wt[j]*_s[j]; 
    _t[0] = purelin(sum + _b[0]); 
  }
  break;
case 2:
  {
    // _nt = 2
    sum = 0; 
    #pragma omp for reduction(+: sum)
    for(int j = 0; j < _nn; j++)      
      sum += _wt[j]*_s[j] + _wt[_nn+j]*_s[j]; 
    _t[0] = purelin(sum + _b[0]); 
    _t[1] = purelin(sum + _b[1]); 
  }
  break;
case 3:
  {
    // _nt = 3
    sum = 0; 
    #pragma omp for reduction(+: sum)
    for(int j = 0; j < _nn; j++)      
      sum += _wt[j]*_s[j] + _wt[_nn+j]*_s[j] + _wt[_nn*2+j]*_s[j]; 
    _t[0] = purelin(sum + _b[0]); 
    _t[1] = purelin(sum + _b[1]); 
    _t[2] = purelin(sum + _b[2]); 
  }
  break;
default:
  {
    // _nt = 4, 5, 6, ...
    #pragma omp for 
    for(int i=0; i<_nt; i++){ 
      sum = 0; 
      // removed #pragma omp parallel for... 
      for(j = 0; j < _nn; j++)      
        sum += _wt[_nn*i+j]*_s[j]; 
      _t[i] = purelin(sum + _b[i]); 
     } 
  }
} // switch(_nt)

[/cpp]

Jim Dempsey

www.quickthreadprogramming.com

In the case where _nt is very small (1, 2, 3) and _nn is reasonably large consider inverting the loops:
...
switch(_nt)
{
case 1:
  {
    // _nt = 1
    sum = 0;  
    #pragma omp for reduction(+: sum)
    for(int j = 0; j < _nn; j++)       
      sum += _wt[j]*_s[j];  
    _t[0] = purelin(sum + _b[0]);  
  }
  break;
case 2:
  {
    // _nt = 2
    sum = 0;  
    #pragma omp for reduction(+: sum)
    for(int j = 0; j < _nn; j++)       
      sum += _wt[j]*_s[j] + _wt[_nn+j]*_s[j];  
    _t[0] = purelin(sum + _b[0]);  
    _t[1] = purelin(sum + _b[1]);  
  }
  break;
case 3:
  {
    // _nt = 3
    sum = 0;  
    #pragma omp for reduction(+: sum)
    for(int j = 0; j < _nn; j++)       
      sum += _wt[j]*_s[j] + _wt[_nn+j]*_s[j] + _wt[_nn*2+j]*_s[j];  
    _t[0] = purelin(sum + _b[0]);  
    _t[1] = purelin(sum + _b[1]);  
    _t[2] = purelin(sum + _b[2]);  
  }
  break;
default:
  {
    // _nt = 4, 5, 6, ...
    #pragma omp for  
    for(int i=0; i<_nt; i++){  
      sum = 0;  
      // removed #pragma omp parallel for...  
      for(j = 0; j < _nn; j++)       
        sum += _wt[_nn*i+j]*_s[j];  
      _t[i] = purelin(sum + _b[i]);  
     }  
  }
} // switch(_nt)

Jim Dempsey

www.quickthreadprogramming.com

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui