parallel_for speedup question

parallel_for speedup question

Portrait de vishketan

I am trying to use parallel_for to speed up the following simple code:

#include <iostream>
#include <vector>
#include <cassert>
#include "tbb/tbb.h"
#include "tbb/scalable_allocator.h"
#include "tbb/parallel_for.h"
#include "tbb/tick_count.h"
using namespace std;
using namespace tbb;
// vectors
typedef vector<double, scalable_allocator<double> > LatentVecType;
// matrices are vectors of vectors
typedef vector<LatentVecType, scalable_allocator<LatentVecType> > LatentMatType;
class Updater{
int dim;
 int num_mat1_rows;
 int num_mat2_rows;
 LatentMatType *p_mat1;
 LatentMatType *p_mat2;
Updater(int _dim, 
 int _num_mat1_rows, 
 int _num_mat2_rows, 
 LatentMatType *_p_mat1,
 LatentMatType *_p_mat2):dim(_dim), 
 p_mat2(_p_mat2){ } 
 void operator()(const blocked_range<int>& r) const {
 for(int i=r.begin(); i!=r.end(); i++){
 int index=i;
 LatentVecType& vec1=(*p_mat1)[i]; 
 for(int j=0; j<500; j++){
 LatentVecType& vec2=(*p_mat2)[index];
 double dot=0.0;
 for(int t=0; t < dim; t++)
 for(int t=0; t<dim; t++){
int main(int argc, char **argv) {
task_scheduler_init init;
if( argc < 2 ) {
 cout << "Usage: " << argv[0] << " <serial or parallel> (0 or 1) <grainsize> (int, default 1)" << endl;
int par = atoi(argv[1]);
 assert(par==0 || par==1);
int grain = 1;
 assert(grain > 0);
 int dim=100;
 int num_mat1_rows=100000; 
 int num_mat2_rows=20000;
 int iter_num=1;
 LatentMatType *p_mat1 = scalable_allocator<LatentMatType>().allocate(1);
 LatentMatType *p_mat2 = scalable_allocator<LatentMatType>().allocate(1);
 new (p_mat1) LatentMatType(num_mat1_rows, LatentVecType(dim));
 new (p_mat2) LatentMatType(num_mat2_rows, LatentVecType(dim));
 Updater U(dim, num_mat1_rows, num_mat2_rows, p_mat1, p_mat2);
tick_count start_time = tick_count::now();
 for(int iter=0; iter<iter_num; iter++){
 cout << "iter: " << iter << endl;
 parallel_for(blocked_range<int>(0, num_mat1_rows, grain), U); 
 U(blocked_range<int>(0, num_mat1_rows));
 double elapsed_seconds = (tick_count::now() - start_time).seconds();
 cout << "elapsed seconds: " << elapsed_seconds << endl;
 return 0;

On a 24 core machine (I can see using top that all cores are occupied) I only get a factor of 2-4 speedup. I am clearly missing something obvious. Please help.

On a side note, I noticed that computing the dot product is not very expensive but this line 


is very expensive. Is it because, in general writing to a memory location is more expensive than reading from it?



4 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de jimdempseyatthecove

The inner most loop is memory bandwidth limited (very little computations can be performed with registers and localized to L1 cache.

The compiler optimization should have fused the two inner loops, verify that it did. And also get the vectorization report to assure the fused loop is vectorized. If not, the try the following code:

double dot=0.0;
for(int t=0; t < dim; t++) {

Verify that the above gets vectorized (SSE or AVX)

Jim Dempsey
Portrait de vishketan

Thanks for the vectorization tip. The simplified example computes dot and then does some updates to the vector in the second loop. In my real code I compute some quantity and use it to update the vector in the second loop. Hence I need to have two loops. The first loop was getting vectorized but the second was not (strange). I did add a pragma directive to vectorize the loop and it improved single thread execution time by 20%. However, the multi-threaded version did not show much difference. Is there any general guidelines on how to deal with such memory bandwidth limited programs in a multi-threaded setting?



Portrait de jimdempseyatthecove

Use VTune to aid in finding out what is going on.

Jim Dempsey

Connectez-vous pour laisser un commentaire.