parallel_for speedup question

parallel_for speedup question

Аватар пользователя vishketan

I am trying to use parallel_for to speed up the following simple code:

#include <iostream>
#include <vector>
#include <cassert>
#include "tbb/tbb.h"
#include "tbb/scalable_allocator.h"
#include "tbb/parallel_for.h"
#include "tbb/tick_count.h"
using namespace std;
using namespace tbb;
// vectors
typedef vector<double, scalable_allocator<double> > LatentVecType;
// matrices are vectors of vectors
typedef vector<LatentVecType, scalable_allocator<LatentVecType> > LatentMatType;
class Updater{
int dim;
 int num_mat1_rows;
 int num_mat2_rows;
 LatentMatType *p_mat1;
 LatentMatType *p_mat2;
Updater(int _dim, 
 int _num_mat1_rows, 
 int _num_mat2_rows, 
 LatentMatType *_p_mat1,
 LatentMatType *_p_mat2):dim(_dim), 
 p_mat2(_p_mat2){ } 
 void operator()(const blocked_range<int>& r) const {
 for(int i=r.begin(); i!=r.end(); i++){
 int index=i;
 LatentVecType& vec1=(*p_mat1)[i]; 
 for(int j=0; j<500; j++){
 LatentVecType& vec2=(*p_mat2)[index];
 double dot=0.0;
 for(int t=0; t < dim; t++)
 for(int t=0; t<dim; t++){
int main(int argc, char **argv) {
task_scheduler_init init;
if( argc < 2 ) {
 cout << "Usage: " << argv[0] << " <serial or parallel> (0 or 1) <grainsize> (int, default 1)" << endl;
int par = atoi(argv[1]);
 assert(par==0 || par==1);
int grain = 1;
 assert(grain > 0);
 int dim=100;
 int num_mat1_rows=100000; 
 int num_mat2_rows=20000;
 int iter_num=1;
 LatentMatType *p_mat1 = scalable_allocator<LatentMatType>().allocate(1);
 LatentMatType *p_mat2 = scalable_allocator<LatentMatType>().allocate(1);
 new (p_mat1) LatentMatType(num_mat1_rows, LatentVecType(dim));
 new (p_mat2) LatentMatType(num_mat2_rows, LatentVecType(dim));
 Updater U(dim, num_mat1_rows, num_mat2_rows, p_mat1, p_mat2);
tick_count start_time = tick_count::now();
 for(int iter=0; iter<iter_num; iter++){
 cout << "iter: " << iter << endl;
 parallel_for(blocked_range<int>(0, num_mat1_rows, grain), U); 
 U(blocked_range<int>(0, num_mat1_rows));
 double elapsed_seconds = (tick_count::now() - start_time).seconds();
 cout << "elapsed seconds: " << elapsed_seconds << endl;
 return 0;

On a 24 core machine (I can see using top that all cores are occupied) I only get a factor of 2-4 speedup. I am clearly missing something obvious. Please help.

On a side note, I noticed that computing the dot product is not very expensive but this line 


is very expensive. Is it because, in general writing to a memory location is more expensive than reading from it?



4 posts / 0 новое
Последнее сообщение
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Аватар пользователя jimdempseyatthecove

The inner most loop is memory bandwidth limited (very little computations can be performed with registers and localized to L1 cache.

The compiler optimization should have fused the two inner loops, verify that it did. And also get the vectorization report to assure the fused loop is vectorized. If not, the try the following code:

double dot=0.0;
for(int t=0; t < dim; t++) {

Verify that the above gets vectorized (SSE or AVX)

Jim Dempsey
Аватар пользователя vishketan

Thanks for the vectorization tip. The simplified example computes dot and then does some updates to the vector in the second loop. In my real code I compute some quantity and use it to update the vector in the second loop. Hence I need to have two loops. The first loop was getting vectorized but the second was not (strange). I did add a pragma directive to vectorize the loop and it improved single thread execution time by 20%. However, the multi-threaded version did not show much difference. Is there any general guidelines on how to deal with such memory bandwidth limited programs in a multi-threaded setting?



Аватар пользователя jimdempseyatthecove

Use VTune to aid in finding out what is going on.

Jim Dempsey

Зарегистрируйтесь, чтобы оставить комментарий.