Slow memory access

Hi!!, I have to access the array p[k] in loop of k index of below code to save any calculated value. But it takes much time to access the array. I checked access time in my machine (VS2008, release mode, i7(6 cores), windows 7):- there is empty loop(no calculation and no array access): 17ms- there is the calculation part of (1): 17ms- no calculation part and there is just "p[k] = 1000" in the loop : 21ms- calculation part (1) + "p[k] = 1000" : 57ms- calculation part (1) + "p[k] = val" : 123msI cannot understand why it takes so long time to access the array "p" in the for loop.=======================================#include "tbb/tbb.h"#include "tbb/blocked_range.h"#include "tbb/blocked_range2d.h"#include "tbb/parallel_for.h"#include "tbb/tick_count.h"#include "tbb/scalable_allocator.h"using namespace std;using namespace tbb;/////class Average : public CPartLoop{ public: ... int hh, ww; void operator() (const blocked_range2d& range) const; void ParallelProcessing(int hh, int ww);};void Average::operator() (const blocked_range2d& range) const { register int i,j,k; float *p = scalable_allocator().allocate(100); for(i=range.rows().begin(); i != range.rows().end(); ++i) { int ii = (int)(i/50); int jj = i-ii*50; for(j=range.cols().begin(); j != range.cols().end(); ++j) { int dd = (int)(j/36); int ss = j-dd*36; for(k=0; k { //// (1) math calculation part //// int nr = (int)kp.r[k]; int nc = (int)kp.c[k]; if( ii+nr>hh-1 || ii+nr<0 || jj+nc > ww-1 || jj+nc<0) continue; int index1 = (ii+nr)*ww; int tr = A[index1+(jj+nc)]; int tc = B[index1+(jj+nc)]; float te = C[index1+(jj+nc)]; float aa = dd + (kp.d[k]*r2d); float bb = kp.e[k]; float val = (tr+tc)*cos(aa*d2r)*(bb * te); ////// end of (1) /////// p[k] = val; // 1000 // (2) array access part } } } tbb::scalable_allocator().deallocate(p,100);}void Average::ParallelProcessing(int hh, int ww){ this->hh = hh; this->ww = ww; parallel_for(blocked_range2d(0,50*50,4,0,360,4), *this);}

Before we look at anything else, try to "hoist" the range boundary "out of the loop" (just in the inner loop may well be enough), by assigning it to a value that can then be used in the condition instead. The compiler may then be able to perform better optimisations.

(Added 2012-04-03) Well, maybe not here...

Could you provide a complete Test-Case? Here is a list of compilation errors:

error C2504: 'CPartLoop' : base class undefined
error C2065: 'kp' : undeclared identifier
error C2065: 'A' : undeclared identifier
error C2065: 'B' : undeclared identifier
error C2065: 'C' : undeclared identifier
error C2065: 'r2d' : undeclared identifier
error C2065: 'd2r' : undeclared identifier