I cannot seem to speed-up further -- could you please criticize my code?

I cannot seem to speed-up further -- could you please criticize my code?

Hi,

I tried with the following compiler options ... And then I spent a few dyas on tweaking the code for further speedup. No matter how I trialed and experimented the settings, I couldn't get further speedup. I decided to jump out of my old paradigm and to learn how other people would do the programming with high performance. Could anybody please feel free to criticize my code and give me some suggestions? The more you criticize, the more I can learn. Thank you very much for your help!

It's very confusing, for example, see the following quote from the build log. Is the loop parallelized or not at the end of the day? I have two Xeon 2.4GHz CPU supporting SSE2. Thanks!

. ry2.cpp(100): (col. 2) remark: loop was not parallelized: insufficient computational work.
. ry2.cpp(108): (col. 2) remark: LOOP WAS AUTO-PARALLELIZED.
. ry2.cpp(110): (col. 3) remark: loop was not parallelized: insufficient inner loop.

----------------------------------------

#include
#include "imsl.h"
#include "mex.h"

#define MaxNumEqns 1000000

//Intel C++ compiler options: /O3 /Qopt-report /Qopt-report-phase hlo /QxW /Qopt-report3 /Qparallel /Qpar-report3

#pragma comment(linker,"/NODEFAULTLIB:libc.lib")

static double kappa, theta, sigma2, f, v, abserr, x0, T0;
static double *pps, *pys, *pV, *pT;
static int nps, nys, nV, nT;
static double *res_Re, *res_Im;
static double y[MaxNumEqns];

void fcn (int neq, double t, double y[], double yprime[])
{
//y: 0 -- alpha_real,
// 1 -- alpha_imag,
// 2 -- beta_real,
// 3 -- beta_imag.

for (int i=0; i {
double sI=0, sR=0;

for (int k=0; k {
double tmp=1+y[i*4+2]*(-pys[k]);
double tmp2=tmp*tmp;
double tmp3=y[i*4+3]*(-pys[k]);
double tmp4=tmp3*tmp3;
sR=sR+pps[k]*tmp/(tmp2+tmp4);
sI=sI+pps[k]*y[i*4+3]*(-pys[k])/(tmp2+tmp4);
}

yprime[i*4+0] = kappa*theta*y[i*4+2]-f+f*sR;
yprime[i*4+1] = kappa*theta*y[i*4+3]-f*sI;
yprime[i*4+2] = -kappa*y[i*4+2]+0.5*sigma2*(y[i*4+2]*y[i*4+2]-y[i*4+3]*y[i*4+3]);
yprime[i*4+3] = pV[i] - kappa*y[i*4+3]+sigma2*y[i*4+2]*y[i*4+3];

}
}

void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray*prhs[] )
{
//v, T, kappa, theta, sigma2, f, ys, ps, x0, T0, abserr

//v and T can be vectors, ys and ps are vectors.

kappa=*(mxGetPr(prhs[2]));
theta=*(mxGetPr(prhs[3]));
sigma2=*(mxGetPr(prhs[4]));
f=*
(mxGetPr(prhs[5]));

nV=mxGetNumberOfElements(prhs[0]); //Number of "v"'s

pV=mxGetPr(prhs[0]);

nT=mxGetNumberOfElements(prhs[1]); //Number of "T"'s

pT=mxGetPr(prhs[1]);

nps=mxGetNumberOfElements(prhs[7]); //Number of "p"'s

pps=mxGetPr(prhs[7]);

nys=mxGetNumberOfElements(prhs[6]); //Number of "y"'s
//08/10/2007: Note: due to a typo, the y's have to be negated before passing into this function.
//08/11/2007: Note: the above restriction has been removed and the negation is no longer needed.

pys=mxGetPr(prhs[6]);

x0=*(mxGetPr(prhs[8]));

T0=*(mxGetPr(prhs[9]));

abserr=*(mxGetPr(prhs[10]));

plhs[0]=mxCreateDoubleMatrix(nV, nT, mxCOMPLEX); //The solutions of A, "v" is the row, "T" is the col.

plhs[1]=mxCreateDoubleMatrix(nV, nT, mxCOMPLEX); //The solutions of B, "v" is the row, "T" is the col.

res_Re=mxGetPr(plhs[0]); //These are the outputs to be passed back to Matlab

res_Im=mxGetPi(plhs[0]);

int nstep;
char *state;

double t = 0.0; /* Initial time */

for (int i=0; i
imsl_d_ode_runge_kutta_mgr(IMSL_ODE_INITIALIZE, &state, IMSL_TOL, abserr, 0);

for (int j=0; j
imsl_d_ode_runge_kutta_mgr(IMSL_ODE_RESET, &state, 0);

for (int j=0; j {
for (int i=0; i {
res_Re[j*nV+i]=exp(y[i*4+0]+y[i*4+2]*x0)*cos(y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);
res_Im[j*nV+i]=exp(y[i*4+0]+y[i*4+2]*x0)*sin(y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);
}
}

return;
}

Build log:
---------------------------------------

------ Rebuild All started: Project: try2, Configuration: Release Win32 ------

Deleting intermediate files and output files for project 'try2', configuration 'Release|Win32'.
Compiling with Intel C++ 10.0.025 [IA-32]... (Intel C++ Environment)
icl: command line warning #10120: overriding '/O2' with '/O3'
icl: command line warning #10121: overriding '/Qvc8' with '/Qvc7.1'
stdafx.cpp
DllMain.cpp
procedure: DllMain
procedure: DllMain
try2.cpp
. ry2.cpp(95): warning #177: variable "nstep" was declared but never referenced
int nstep;
^

. ry2.cpp(12): warning #177: variable "v" was declared but never referenced
static double kappa, theta, sigma2, f, v, abserr, x0, T0;
&nb
sp; ^
procedure: mexFunction
procedure: mexFunction

HLO REPORT LOG OPENED ON Sat Aug 11 16:45:23 2007

<. ry2.cpp;-1:-1;hlo;_mexFunction;0>
High Level Optimizer Report (_mexFunction)
Adjacent Loops: 3 at line 100
Unknown loop at line #104
QLOOPS 4/4 ENODE LOOPS 4 unknown 1 multi_exit_do 0 do 3 linear_do 3
LINEAR HLO EXPRESSIONS: 38 / 61
------------------------------------------------------------------------------

. ry2.cpp(100): (col. 2) remark: loop was not parallelized: insufficient computational work.
. ry2.cpp(108): (col. 2) remark: LOOP WAS AUTO-PARALLELIZED.
. ry2.cpp(110): (col. 3) remark: loop was not parallelized: insufficient inner loop.
. ry2.cpp(100): (col. 2) remark: LOOP WAS VECTORIZED.
. ry2.cpp(110): (col. 3) remark: LOOP WAS VECTORIZED.
. ry2.cpp(110): (col. 3) remark: LOOP WAS VECTORIZED
<. ry2.cpp;104:104;hlo_scalar_replacement;in _mexFunction;0>
#of Array Refs Scalar Replaced in _mexFunction at line 104=1

<. ry2.cpp;108:108;hlo_linear_trans;_mexFunction;0>
Loop Interchange not done due to: User Function Inside Loop Nest
Advice: Loop Interchange, if possible, might help Loopnest at lines: 108 110
: Suggested Permutation: (1 2 ) --> ( 2 1 )
procedure: fcn
procedure: fcn

<. ry2.cpp;-1:-1;hlo;?fcn@@YAXHNQAN0@Z;0>
High Level Optimizer Report (?fcn@@YAXHNQAN0@Z)
QLOOPS 2/2 ENODE LOOPS 2 unknown 0 multi_exit_do 0 do 2 linear_do 2
LINEAR HLO EXPRESSIONS: 69 / 101
------------------------------------------------------------------------------
.
. ry2.cpp(26): (col. 2) remark: loop was not parallelized: existence of parallel dependence.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 41.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence
between yprime line 40, and (unknown) line 32.
. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between (unknown) line 32, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 32.
. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between y line 32, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and (unknown) line 34.
. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between (unknown) line 34, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 34.
. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between y line 34, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and (unknown) line 37.
. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between (unknown) line 37, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 37.
. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between y line 37, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 40, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 40, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 40, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 40, and yprime line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and y line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 40, and yprime line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and y line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 41, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI depende
nce between y line 42, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 40.
. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between (unknown) line 37, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and (unknown) line 37.
. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between (unknown) line 37, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and (unknown) line 37.
. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between (unknown) line 37, and yprime line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and (unknown) line 37.
. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between (unknown) line 37, and yprime line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and (unknown) line 37.
. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between y line 37, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 37.
. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between y line 37, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 37.
. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between y line 37, and yprime line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and y line 37.
. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between y line 37, and yprime line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and y line 37.
. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between (unknown) line 34, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and (unknown) line 34.
. ry2.cpp(34): (col. 27) remark: paralle
l dependence: proven ANTI dependence between (unknown) line 34, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and (unknown) line 34.
. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between (unknown) line 34, and yprime line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and (unknown) line 34.
. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between (unknown) line 34, and yprime line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and (unknown) line 34.
. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between y line 34, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 34.
. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between y line 34, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 34.
. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between y line 34, and yprime line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and y line 34.
. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between y line 34, and yprime line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and y line 34.
. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between (unknown) line 32, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and (unknown) line 32.
. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between (unknown) line 32, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and (unknown) line 32.
. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between (unknown) line 32, and yprime line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and (unknown) line 32.
. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between (unknown) line 32, and yprime line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and (unknown) line 32.
. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between y line 32, and yprime line 40.
. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 32.
. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between y line 32, and yprime line 41.
. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 32.
. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between y line 32, and yprime line 42.
. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and y line 32.
. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence betwe
en y line 32, and yprime line 43.
. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and y line 32.
. ry2.cpp(30): (col. 3) remark: loop was not parallelized: insufficient computational work.
. ry2.cpp(30): (col. 3) remark: LOOP WAS VECTORIZED

<. ry2.cpp;26:26;hlo_linear_trans;?fcn@@YAXHNQAN0@Z;0>
Loop Interchange not done due to: Imperfect Loop Nest (Either at Source or due to other Compiler Transformations)
Advice: Loop Interchange, if possible, might help Loopnest at lines: 26 30
: Suggested Permutation: (1 2 ) --> ( 2 1 )

...

---------------------- Done ----------------------

Rebuild All: 1 succeeded, 0 failed, 0 skipped

54 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

OOops, I got a huge trouble when I ran it within from Matlab. It is a dll that is callable from within Matlab.

Please see the screen-shots below,

http://img522.imageshack.us/img522/9917/uuuty2.jpg

It even terminates my Matlab, and I've lost all my data.

What's wrong? What can I do now?

Thanks a lot!

I noticed that if I remove "hlo" in the ICC compiler setting, then it can run without problem...

What's wrong?

Thanks!

It seems that I have to further remove the "QxW" and/or "QxN"...

That's to say, I couldn't put any "QxW", "QxN", or "hlo", etc.

They are like forbidden keywords. Otherwise the error window will pop up...

Help!

It sounds like you have a library clash. Could it be because of a binary compatibility issue between Matlab's libraries and binary ICC produced?

I presume that by hlo you mean /Qopt-report-hlo, as that's the only switch I can find with "hlo" in it. That shouldn't do anything in terms of binary compatibility, it's just a compile time diagnostic output.

I don't think you can use more than one /Qx option at a time. Try just /QxW. What version of the compiler are you using? If you're using v10.x, try the latest v9.1. I've found v10 to still be quite buggy and I get better resutls with v9.1.

Consider using memset() instead of this (line 100)
for (int i=0; i
Loop on line 108 was parallelised, but the one on line 110 wasn't. Enable /Qvec-report3 and see what it says.
Did the loop on line 110 vectorize?

Move your declaration of tmp doubles on lines 32-35 to the top of the function and make them static. Otherwise they have to get malloc()-ed every time the loop goes around, and that's expensive.

On line 40 and 41 you are calulating
kappa*theta twice needlessly.

Similarly you're calculating kappa*y[i*4+3] on lines 41 and 43. Have a look through your expressions and analyze what you are working out multiple times, and cache it using a tmp variable instead.

You are also using y[i*4+(whatever)] before you set it. Are you sure that's what you want to be doing? It looks like you're using a lot of this array uninitialized.

Just in case it wasn't implied strongly enough in my previous posts - consider what you're doing in your innermost loop:

res_Re[j*nV+i]=exp(y[i*4+0]+y[i*4+2]*x0)*cos(y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);
res_Im[j*nV+i]=exp(y[i*4+0]+y[i*4+2]*x0)*sin (y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);

Forst of all, this part is the same in both:
exp(y[i*4+0]+y[i*4+2]*x0)

Further, the parameter to sin() is the same as parameter to cos():
y[i*4+1]+y[i*4+3]*x0+T0*pV[i]

Lastly, your array index is the same for both res_Re[] and res_Im[]

That means you could re-write this as follows:

static double exp_param;
static double trig_param;
static unsigned int indexj;
static unsigned int indexi;
for (int j = 0; j < nT; j++)
{
indexj = j * nV;

for (int i=0; i {
exp_param = exp(y[i*4+0] + y[i*4+2] * x0);
trig_param = y[i*4+1] + y[i*4+3] * x0 + T0 * pV[i];
index = indexj + i;

res_Re[index]=exp_param * cos(trig_param);
res_Im[index]=exp_param * sin(trig_param);
}
}

This will speed up your innermost loop by about 2x. Look at the vectorizer report and make sure it vectorizes. Some of these expressions are quite nasty. You may find that you have alignment issues that prevent vectorizing in y[]. If that is the case, split y[] into 4 arrays. In the example below I will call them y0-y3. This would give you:

exp_param = exp(y0[i*4] + y2[i*4] * x0);
trig_param = y1[i*4] + y3[i*4] * x0 + T0 * pV[i];

If you do that, you can also break i*4 out as a constant term and only work it out once. Finally, you are working sin() and cos() of the same values. Thus you could use the trigonometric identity
(sin(x))^2 * (sin(y))^2 = 1
to save yourself some more time. Luckily, there is a builtin function that does this for you already. From the GCC man page (I'd guess that ICC implementation is compatible):

void sincos(double x, double *sin, double *cos);

So, for the final optimized version:

static double exp_param;
static double trig_param;
static unsigned int indexj;
static unsigned int indexi;
static double my_sin;
static double
my_cos;

for (int j = 0; j < nT; j++)
{
indexj = j * nV;

for (int i=0; i {
exp_param = exp(y0[i] + y2[i] * x0);
trig_param = y1[i] + y3[i] * x0 + T0 * pV[i];
indexi = indexj + i;

sincos(trig_param, &my_sin, &my_cos);

res_Re[indexi]=exp_param * my_sin;
res_Im[indexi]=exp_param * my_cos;
}
}

This should give you a very noticeable speed-up.

Just out of interest, how'd you get on with optimizing your last bit of code with hundreds of temp variables?

Thanks! I did find its a library clash between Matlab and Intel.

I've sent SoS 's to both sides, without getting further reponses.

Thanks, but is it too much trouble to uninstall the V10 and reinstall V9.1?

wordy79@hotmail.com:Consider using memset() instead of this (line 100)
for (int i=0; i

I thought that's the C way of doing things and I found it is neater to do the C++ way.

Is that true that for the sake of speed I really have to sacrifice C++?

I've already found that if I use vector in C++, it is dynamically allocated, and in order to gain speedup, I have to switch to C style array.

So now my program is terribe -- some parts are in C++, and some parts are in C.

Move your declaration of tmp doubles on lines 32-35 to the top of the function and make them static. Otherwise they have to get malloc()-ed every time the loop goes around, and that's expensive.

I thought a modern compiler can handle this in an efficient way for me.

wordy79@hotmail.com:Just in case it wasn't implied strongly enough in my previous posts - consider what you're doing in your innermost loop:

res_Re[j*nV+i]=exp(y[i*4+0]+y[i*4+2]*x0)*cos(y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);
res_Im[j*nV+i]=exp(y[i*4+0]+y[i*4+2]*x0)*sin (y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);

Forst of all, this part is the same in both:
exp(y[i*4+0]+y[i*4+2]*x0)

Further, the parameter to sin() is the same as parameter to cos():
y[i*4+1]+y[i*4+3]*x0+T0*pV[i]

Lastly, your array index is the same for both res_Re[] and res_Im[]

That means you could re-write this as follows:

static double exp_param;
static double trig_param;
static unsigned int indexj;
static unsigned int indexi;
for (int j = 0; j < nT; j++)
{
indexj = j * nV;

for (int i=0; i {
exp_param = exp(y[i*4+0] + y[i*4+2] * x0);
trig_param = y[i*4+1] + y[i*4+3] * x0 + T0 * pV[i];
index = indexj + i;

res_Re[index]=exp_param * cos(trig_param);
res_Im[index]=exp_param * sin(trig_param);
}
}

This will speed up your innermost loop by about 2x. Look at the vectorizer report and make sure it vectorizes. Some of these expressions are quite nasty.

Thanks a lot! You really have good points above!

You may find that you have alignment issues that prevent vectorizing in y[]. If that is the case, split y[] into 4 arrays.
In the example below I will call them y0-y3. This would give you:

exp_param = exp(y0[i*4] + y2[i*4] * x0);
trig_param = y1[i*4] + y3[i*4] * x0 + T0 * pV[i];

If you do that, you can also break i*4 out as a constant term and only work it out once.

I could not, that's the way it is, for solving an ODE numerically. I am using a 3rd party library here. That function only accepts y as a whole.

Finally, you are working sin() and cos() of the same values. Thus you could use the trigonometric identity
(sin(x))^2 * (sin(y))^2 = 1
to save yourself some more time. Luckily, there is a builtin function that does this for you already. From the GCC man page
(I'd guess that ICC implementation is compatible):

void sincos(double x, double *sin, double *cos);

Very good idea! I thought there is no such function in C++? Or it is a complier supplement?

So, for the final optimized version:

static double exp_param;
static double trig_param;
static unsigned int indexj;
static unsigned int indexi;
static double my_sin;
static double my_cos;

for (int j = 0; j < nT; j++)
{
indexj = j * nV;

for (int i=0; i {
exp_param = exp(y0[i] + y2[i] * x0);
trig_param = y1[i] + y3[i] * x0 + T0 * pV[i];
indexi = indexj + i;

sincos(trig_param, &my_sin, &my_cos);

res_Re[indexi]=exp_param * my_sin;
res_Im[indexi]=exp_param * my_cos;
}
}

This should give you a very noticeable speed-up.

Very nice! Thanks a lot!

Just out of interest, how'd you get on with optimizing your last bit of code with hundreds of temp variables?

What do you mean?

If you do that, you can also break i*4 out as a constant term and only work it out once.

I thought the modern compiler should really take care of these things for me...

Thanks, but is it too much trouble to uninstall the V10 and reinstall V9.1?

I have no idea how difficult it is on Windows, but on Linux I have both installed. I just set the paths to whichever one I want to use.

wordy79@hotmail.com:Consider using memset() instead of this (line 100)
for (int i=0; i

I thought that's the C way of doing things and I found it is neater to do the C++ way.

Can you explain to me where is the "C++ way" in:

for (int i=0; i
? That'll work in C too. And it'll be slower than memset() in C just like it is in C++. If it works in C, it can't be C++.

Is that true that for the sake of speed I really have to sacrifice C++?

Not at all. I write C++. I like classes, because they allow me to wrap up data and associated methods nicely. I have yet to see one of my libraries slow down when being re-written from a procedural C to a class based implementation. At worst you'll get a speed penalty of a pointer de-reference on method and property calls, but that's typically negligible. You can even do pseudo classes in C by having structs with function pointers in them. In the old days, C++ compilers would pre-process C++ to straight C, and then fall back on the C compiler for the actualy machine code production.

I thought that's the C way of doing things and I found it is neater to do the C++ way.

Can you please explain how a one line loop is neater than a one line function call? Sorry, but I can't see it. C++ is a superset of C. C++ is not C, but C is C++.

Is that true that for the sake of speed I really have to sacrifice C++?

Your previous post with 500+ temp variables was straight C as far as I remember, and I didn't see that going particularly fast.It has nothing to do with the question of C vs. C++. You can write slow code in either.

Move
your declaration of tmp doubles on lines 32-35 to the top of the
function and make them static. Otherwise they have to get malloc()-ed
every time the loop goes around, and that's expensive.

I thought a modern compiler can handle this in an efficient way for me.

You seem to have a fairly thorough lack of understanding of what compilers will and will not do for you. There isn't, and never will be, a compiler based replacement for good coding. If you were taught to leave all the optimization to the compiler, then I suggest you go to the institution that issued you with your programming qualification and demand your tuition fees back, 'cause they clearly taught you things that are 100% wrong. Compiler will produce machine code according to the higher level code you write. If you declare/allocate variables in a particular place, that's where they'll get allocated. The compiler is there to help you, not to override you.

You may find that you have alignment issues that prevent vectorizing in y[]. If that is the case, split y[] into 4 arrays.


I could not, that's the way it is, for solving an ODE numerically. I am
using a 3rd party library here. That function only accepts y as a whole.

So break it up inside the function. You may find that the speed-up is worth it. Try both and see what performance you get.

void sincos(double x, double *sin, double *cos);

Very good idea! I thought there is no such function in C++? Or it is a complier supplement?

No idea, but GCC, ICC and most other compilers have it built in.

If you do that, you can also break i*4 out as a constant term and only work it out once.

I thought the modern compiler should really take care of these things for me...

Absolutely not. And I don't see it happening in our lifetimes, either.


Just out of interest, how'd you get on with optimizing your last bit of code with hundreds of temp variables?


What do you mean?

Most techniques I used above to speed your code up are also applicable to the code in your previous question. I was wondering if you managed to get the speed-up you wanted out of it.

On a separate note, on code that vectorizes floats will be 2x faster than doubles. If you don't absolutely need the extra precision, but could do with the extra speed, use floats.

>Consider using memset() instead of this (line 100)

>for (int i=0; i

Actually, the compiler converts this to a memset already, no need to recode it. It worked for me in a small test case. I was not able to compile what was posted because there were several missing definitions, but this should work fine.

>Move your declaration of tmp doubles on lines 32-35 to >the top of the function and make them static. Otherwise >they have to get malloc()-ed every time the loop goes >around, and that's expensive.

No they don't. Try it and see. Making things static when they don't need to be is a good way to prevent optimizations. There's no need to do this, it's something compilers are quite good at.

>I thought a modern compiler can handle this in an >efficient way for me.

Yes it can.

>There isn't, and never will be, a compiler based >replacement for good coding.

Certainly, but pulling temporary variables out of loops and making them static is not good coding, it's more akin premature optimization, AKA "the root of a all evil" :-).

Also, while using sincos may be a fine idea, again it is something that the compiler will figure out in many cases. I know it worked in a small test case. If you can provide a standalone working test case of your code I can verify whether it works in your case.

Compilers can do common subexpressions and all sorts of cool things just fine, don't complicate your code just because you think it will run faster. At the very least, measure it to see if it makes a difference, or check the asm to see if the compiler is already doing it for you.

I can assure you that a modern compiler should in general really take care of most of these things for you. You should write your code first to be clear and clean, then optimize based on real data as needed.

Dale

Case in point, with our current (10.0) compiler, this loop vectorizes:

for (int j=0; j {

for (int i=0; i {

res_Re[j*nV+i]=exp(y[i*4+0]+y[i*4+2]*x0)*cos(y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);

res_Im[j*nV+i]=exp(y[i*4+0]+y[i*4+2]*x0)*sin(y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);

}

}

but when you pull out some common subexpressions into static vars, it no longer vectorizes. It's perfectly reasonable to pull out common expressions into sensibly named variables for clarity, but don't make them static, and don't think you're making things faster just by doing that.

Dale

Dale, is there any technical reading on what exactly current versions of Intel Compiler can and can't do for you?

I often get to talk with developers who still believe that compilers are brain-dead (just to clarify — I am not talking about people involved here) and I need to show them how wrong they are. It would help a lot to link them to some good article on the subject, preferably written by compiler writers because they know the best what works and what doesn't.

I really believe that it is the right time to show the world how much compilers have advanced in the last few years.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Making things static when they don't need to be is a good way to
prevent optimizations.

I'm afraid I'll have to disagree until you can produce an example to demonstrate this. Can you explain a single situation where re-allocating a variable every time will not be slower than not re-allocating it? If nothing else, static arrays can be aligned at compile time so they vectorize - something that is not the always case with non-static arrays. In the case I was having with an anonymous union, even using declspec(align(16)) didn't work.

There's no need to do this, it's something
compilers are quite good at.

A piece of example code to demonstrate this bold statement would be appreciated.

Certainly, but pulling temporary variables out of loops and making them
static is not good coding, it's more akin premature optimization, AKA
"the root of a all evil" :-).

Can you explain what you mean by "premature optimization"?

I can assure you that a modern compiler should in general really take
care of most of these things for you. You should write your code first
to be clear and clean, then optimize based on real data as needed.

I am very much unconvinced, but I could be persuaded by a piece of non-trivial sample code. I'm not going to hold my breath for it, though. While I agree that writing code to be clear and clean first is the way to go, I don't agree with the implication that optimized code is an less clear and clean. I am not talking about inline assembly here.


Case in point, with our current (10.0) compiler, this loop vectorizes:

...
but when you pull out some common subexpressions into static vars, it no longer vectorizes.

Except that v10 is demonstrably broken. It certainly doesn't vectorize some pretty simple cases that v9 vectorizes, and v9.1 has produced faster code for me on everything from a P3 to a Core2. I have yet to see a single reall code case where v10 outperforms v9.1. Have you tried it on v9.1? I'll be it vectorizes just fine. And what is the cause of it not vectorizing, anyway? A mistaken aliasing check? v10 makes those quite often. I've seen cases where it is convinced there is a proven dependence where there isn't one, and even pragma ivdep doesn't persuade it to vectorize the loop.

Dale, is there any technical reading on what exactly current versions of Intel Compiler can and can't do for you?

You mean what they try to do for you. Can is possibly too strong a statement. ;-)

I often get to talk with developers who still believe that compilers are brain-dead

Compilers aren't brain dead, merely simple-minded. That doesn't make ICC any less awesome - in most of my number crunching apps it improves performance by 3-7x over all other competing compilers in x86. But even if it can optimize out multiple calculations of the same expression for you, it's ill advised to rely on that in all cases. If for no other reason, because x86 is not the only platform, and ICC is not the only compiler. C/C++ is supposed to be portable, and portability should include the performance of the compiled code.

I really believe that it is the right time to show the world how much compilers have advanced in the last few years.

I don't think they have. ICC has been very good for a very long time, and credit to it for that. Out of the 5 compilers I tried, the one in 2nd place was a whopping 5x slower for me. But I guess I may not have noticed the common sub-expression optimizations because I have always written my code to avoid those, so was never particularly relying on the compiler to fix up bad code for me. If you really think that compilers have advanced much in the past few years, perhaps you can explain why is it that no compiler other than ICC has SSE vectorization support that works even moderately well? Pentium III has been around for the best part of 10 years. By this, most compilers are about 10 years behind the processors, at least on the x86. (Note: GCC has some support, but it is very limited, as demonstrated by the 5x+ performance gap - it doesn't vectorize even half of things that ICC can.)

>> Can you explain a single situation where re-allocating a variable every time will not be slower than not re-allocating it?

Wordy,

In the sample code given the variables were allocated on the stack. All the stack variables are "allocated" in one instruction by way of subtracting the number of bytes required (for all stackvariables in that scope) from the stack pointer "sub esp, nnnn". So allocation is not a problem. Alignmentmay bea problem if the function prolog does not align the stack to 16 byte interval. Their should be an option switch to enable this and then enhance alignment of local variables and stackarrays. And in the process eliminate the requirement for testing alignment. I cannot confirm Intel C++ does this (I use VC++), Intel Fortran does the stack alignment.

Jim Dempsey

www.quickthreadprogramming.com

Jim, Wordy obviously got a bit confused about those variables. What he probably meant to say is that when you write:

void some_fun(void)
{
	int	a[9] = { 0, 3, 5, 11, 33, 48, 213, 88, 69 };
}

The variable a[] may be reinitialized each time on stack (by coding a bunch of MOV instructions) instead of being allocated in data segment on compilation.

Of course, if you make a[] static or global that doesn't happen.

What I am not sure is whether Intel Compiler would automatically allocate a[] in data segment. That is why I asked Dale to comment what compilers can and can't do.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

>I'm afraid I'll have to disagree until you can produce an example to demonstrate this.

I just did. See above.

>Can you explain a single situation where re-allocating a variable every time will not be slower than not re-allocating it?

What makes you think there's any 're-allocating' going on for a simple local scalar variable? You set up the stack frame on entry to the function and you're done, regardless of how many local vars you have. Why do you think that making it static helps in any way?

>A piece of example code to demonstrate this bold statement would be appreciated.

Again, see above for my previously posted example.

>Can you explain what you mean by "premature optimization"?

I'm referring to the well-known maxim from Donald Knuth. Google reveals lots of relevant links including:

http://en.wikipedia.org/wiki/Optimization_(computer_science)

> I have yet to see a single reall code case where v10 outperforms v9.1. Have you tried it on v9.1? I'll be it vectorizes just fine.

Actually, from what I saw, it's the other way round. It vectorizes (when you eliminate the pointless static variable) on 10.0 but not on 9.1.

Dale

>Dale, is there any technical reading on what exactly current versions of Intel Compiler can and can't do for you?

I should have some references at my fingertips but I confess I'm not real organized about that. I think we have some whitepapers and such about Vectorization, Parallelizing and threading, but I suspect we don't have much that covers the basic well-established opts. I'll see what I can dig up.

Dale

Actually, from what I saw, it's the other way round. It vectorizes
(when you eliminate the pointless static variable) on 10.0 but not on
9.1.

Since you seem to be insisting:

foo.cxx:

#include
#include "foo.h"

void foo (float *bar, float *res_Re, float *res_Im)
{
unsigned int nT = 100;
unsigned int nV = 100;

unsigned int indexi;
unsigned int indexj;

float *y;
float *pV;

float x0 = 12.34f;
float T0 = 23.45f;

y = bar;
pV = bar;

for (int j = 0; j < nT; j++)
{
for (int i = 0; i < nV; i++)
{ res_Re[j*nV+i]=expf(y[i*4+0]+y[i*4+2]*x0)*cosf(y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);
res_Im[j*nV+i]=expf(y[i*4+0]+y[i*4+2]*x0)*sinf(y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);
}
}
}

test.cxx:

#include "data.h"
#include "foo.h"

using namespace std;

int main ()
{
float res_Re[10000];
float res_Im[10000];

unsigned int tmp;

for (tmp = 0; tmp < 1000; tmp++)
{
foo (TestDataV, res_Re, res_Im);
}
}

Using ICC v10.0.025:

foo.cxx(26): (col. 3) remark: loop was not vectorized: existence of vector dependence.
foo.cxx(28): (col. 4) remark: vector dependence: proven FLOW dependence between res_Re line 28, and bar line 29.
(insert another 20-30 similar messages about the same non-existant but "proven" dependence)

Using ICC v9.1.051, we get a much more reasonable:

test.cxx(14) : (col. 2) remark: loop was not vectorized: contains unvectorizable statement at line 16.
foo.cxx(24) : (col. 2) remark: loop was not vectorized: not inner loop.
foo.cxx(26) : (col. 3) remark: loop was not vectorized: vectorization possible but seems inefficient.

Difference being that with v9 at least there's the option of setting #pragma vector always, whereas v10 will ignore #pragma ivdep because it considers the vector dependence proven.

So much for v10 vectorizing when v9 doesn't.

With v9: 40.53 seconds on a P3/1.5

Taking out the common sub-expressions manually and replacing the inner loop in foo.cxx:
float exp_cache;
float trig_cache;

for (j = 0; j < nT; j++)
{
indexj = j * nV;

for (i = 0; i < nV; i++)
{
&n
bsp; indexi = indexj + i;
exp_cache = y[i*4+0]+y[i*4+2]*x0;
trig_cache = y[i*4+1]+y[i*4+3]*x0+T0*pV[i];

res_Re[indexi]=expf(exp_cache)*cosf(trig_cache);
res_Im[indexi]=expf(exp_cache)*sinf(trig_cache);
}
}

Still no vectorization, but the run time drops down to 23.05 seconds.

Both cases compiled with:
icpc -march=pentium3 -mcpu=pentium3 -mtune=pentium3 -msse -xK -O3 -ansi-alias -fp-model fast=2 -rcd -align -Zp16 -fomit-frame-pointer -funroll-loops -fpic -w1 -vec-report3 -o test test.cxx foo.cxx

Note that neither example vectorizes, but taking out the common sub-expressions reduces the run time by nearly half.

Perhaps in the face of facts you'd care to re-consider your stated opinion on how good compilers are at figuring out the common sub-expression optimizations?

I was under the impression that yes for local variables the stack is decremented by the number of bytes required for all variables with alignment factored in (aka padded space where required so array and ints and doubles are word/doubleword aligned), however on the termination of a function this would be popped, also there will probably be initialisation for the memory from the stack each function call a series of MOV's however depending on assignments in the functions this initialisation may be optimised out.

Likewise i would have thought that a static would either silently resolve to a global variable with memory from the stack on program run or on the first call of the program, it would be initialised once not every call, the overhead of it in a function using a static should just be a memory dereference rather than stacking and initialising and unstacking.

So I would conclude that ing THEORY the static should be marginally faster after the first call for all subsequent calls, likewise i see no reason that either the static Or the local variable shouldn't vectorise theres no reason the memory shouldn't be aligned for either.

If the static vectorises in 9.x but not in 10 thats a broken bit of compiler in 10, likewise if the local vectorises in 10 not a static thats a broken bit of compiler in 9.x. Considering 10's the latest and i'd hope still being worked on i'd hope that would be resolved in time.

With a simple function that is called an awful lot a static should be fractionally quicker just from not having initialisation, allocation, and deallocation each time, however if the majority of time is spent in loops within the function this really is insignificant and you'll lose it in the noise, With static you have the disadvantage that its memory won't be unallocated till program termination.

Static or local seems to me to be horses for courses if its a tiny function it may make a difference having it static if not then local may be better also varying requirement on how you inialise it - if you're using the old values fine static if you need them re initialised to 0 theres little point on a non trivial function.

All of that however is irrelevant if the compiler cant vectorise it right. I'd be interested to see benchmarks of a static in 9.x vs a local in 10.0
[Edit Wordy appears to have posted his findings before i finished this post]

For the record the stacking unstacking variables is based on what the non icc compiler i'm using in my embedded code appears to do assuming i've read the assembly right so it may not be the same for x86 icc.

The implication of just 'rely on the optimiser' in the earlier post Irks me as an embedded engineer, from my experience the optimiser cannot replace inteligent design and structuring you need to help it to an extent by putting it in a near optimal form and having a basic understanding of the instructions things may resolve to and what can vectorise - this can enable you to not accidentally cause the compiler to take more complex solutions, the compiler cant always resolve what data will be from an external source therefor it must ensure that it copes with all possibilities where dirty hacks can speed things up (ok not the case in the sin cos code but certainly to be considered).

$ icc -no-alias-args -xK foo.cxx test.cxx

foo.cxx(25): (col. 9) remark: LOOP WAS VECTORIZED.

$ icc -V

Intel C Compiler for applications running on IA-32, Version 10.0 Build 20070614 Package ID: l_cc_c_10.0.025

Copyright (C) 1985-2007 Intel Corporation. All rights reserved.

$

There is no initialization of local stack variables unless those variables have an explicit constructor;

void foo()
{
double buffer[100];
...

has no initialization other than for adjusting the stack size. At code entry of foo buffer[100] contains leftover "junk". Note, if the function has any other variables, even compiler temporary variables, then the allocation for buffer[100] is "free".

Static variables are a) either uninitialized, b) initialized at compile time, c) initialized at program startup (via constructor)

Static variables are global with respect to multi-threaded programming, stack variables{defined in scope of thread}are thread local.

The compiler should be capable of forcing alignment of stack local variables. If it is not capable then the compiler writers need to fix this. SSE3 prefers paragraph alignment (16 bytes), extensions later on mayprefer 32 byte alignment. To obtain 16 byte stack alignment is relatively trivial (reserve whatis required+15 bytesthen & ~16 to esp). This is only 1 extra instruction overhead.

I will have to agree with you that some embedded processor archetectures have sizeable overhead in mucking with their "stack". They really do not have a stack, in the traditional sense, but some instead have a rolling set of registers where these are swapped with RAM. On x86 (32 and 64 bit) they use a stack pointer registerto RAM and optionally a stack frame pointer register. Not much to muck with.

Jim Dempsey

www.quickthreadprogramming.com

OK, so -no-alias-args makes v10.0.025 vectorize. So does #pragma vector always with v9.1.051. And v9.1 is still faster:

13.78s with v9.1.051
15.22s with v10.0.025

(best of 3 runs, but they were all within 100ms of each other for both compilers)

Jim,

While I accept the theory of what you are saying, I have seen (and mentioned in this forum) code that doesn't vectorize efficiently when using class property arrays instead of static function scoped arrays. Granted, that may be a compiler bug, but it is nevertheless the case. declspec() didn't help, either, for some reason.

Oops simply reserve what is required then & ~16 to fall down to next paragraph address

sub esp,nBytesRequired
and esp, ~16; then down to next paragraph address

Jim

www.quickthreadprogramming.com

Wordy,

Perhaps the uncertainty of alignment for class (struct)property arrays could be resolved if you explicitly state #pragma pack(1) immediately prior to the class. Then align the members inside by hand and add the appropriate number of bytes at end of class/struct such than if you were to have an array of this class/struct that it meets the alignment requirements of the members within the class/struct. Hopefully then the compiler will be able to determine that an array of this class will meet the alignment requirements (if the 1st element of the array meets the alignment requirement). I know your example did not have an array of the class/struct where you experienced the non-vectorization problem. But the compiler does not know that your class/struct is never used in an array of such class/struct (and thus potentially violating the alignment requirements). If sufficient information is available to the compiler to determine the alignment of arrays of the class/struct then it should be able to vectorize if determination indicates it can vectorize the 1st element in the array of class/struct.

If the compiler will not vectorize then it is not necessarily a bug in the compiler but it certainly would be a shortcomming of the compiler.

Jim Dempsey

www.quickthreadprogramming.com

Wordy,

Perhaps the uncertainty of alignment for class (struct)property arrays could be resolved if you explicitly state #pragma pack(1) immediately prior to the class. Then align the members inside by hand and add the appropriate number of bytes at end of class/struct such than if you were to have an array of this class/struct that it meets the alignment requirements of the members within the class/struct. Hopefully then the compiler will be able to determine that an array of this class will meet the alignment requirements (if the 1st element of the array meets the alignment requirement). I know your example did not have an array of the class/struct where you experienced the non-vectorization problem. But the compiler does not know that your class/struct is never used in an array of such class/struct (and thus potentially violating the alignment requirements). If sufficient information is available to the compiler to determine the alignment of arrays of the class/struct then it should be able to vectorize if determination indicates it can vectorize the 1st element in the array of class/struct.

If the compiler will not vectorize then it is not necessarily a bug in the compiler but it certainly would be a shortcomming of the compiler.

Jim Dempsey

www.quickthreadprogramming.com

Sorry for the double post. This !@#$% forum posting thing needs some serious rework.

When compiling a long response the forum usualy discards your post (timeout thing) and then you have to recompose....

I usualy Ctrl-A, Ctrl-C to copy post into clipboard prior to clicking on Post. Then if post fails, re-reply and paste, Post. In this case the reply went to page 2 of the message threadbut upon exit of Post it returned to page 1 of the message thread. Not seeing the post at the bottom of the page, I thought it dropped the message - so repost....

Will try to be more careful next time.

Jim Dempsey

www.quickthreadprogramming.com

$ icc -V

Intel C Compiler for 32-bit applications, Version 9.1 Build 20070512Z Package ID: l_cc_c_9.1.051

Copyright (C) 1985-2007 Intel Corporation. All rights reserved.

$ icc -no-alias-args -xK foo.cxx test.cxx

$ time ./a.out

real 0m2.304s

user 0m2.296s

sys 0m0.005s

$ time ./a.out

real 0m2.303s

user 0m2.295s

sys 0m0.005s

$ time ./a.out

real 0m2.304s

user 0m2.296s

sys 0m0.003s

$

#Switching to 10.0 compiler:

$ icc -V

Intel C Compiler for applications running on IA-32, Version 10.0 Build 20070614 Package ID: l_cc_c_10.0.025

Copyright (C) 1985-2007 Intel Corporation. All rights reserved.

$ icc -no-alias-args -xK foo.cxx test.cxx

foo.cxx(26): (col. 9) remark: LOOP WAS VECTORIZED.

$ time ./a.out

real 0m0.804s

user 0m0.795s

sys 0m0.003s

$ time ./a.out

real 0m0.712s

user 0m0.698s

sys 0m0.007s

$ time ./a.out

real 0m0.710s

user 0m0.697s

sys 0m0.004s

$

Unless you omitted a part of the v9 compiler output, v9 didn't vectorize. That makes it an apples and oranges comparison. Did you add #pragma vector always before the innermost loop? If we compare the peak achieved performance given same code (including the mentioned pragma statement), then v9 wins. That's not a good thing.

More to the point, none of this takes away the demonstrated fact that common sub-expression elimination optimization simply does not work - and I think that was the main discussed point on this thread.

Part of my point was that the 10.0 compiler vectorizes the code wherethe 9.1 compiler fails to, in an apples to apples comparison, same code, same opts. I haven't tried every possible combination of options and pragmas, but so far in investigating this particular chunk of code I haven't found 9.1 to be faster on the machines I use.

As far as common subexpressions go, you're missing several subtleties. First off, my main point initially was that simplydeclaring localvars as staticwas at best pointless, at worst inhibit performance and in any case bad style. I notice you didn't use statics in your example, and in fact when I tried them as statics it made no difference in performance. I believe the reason you're really seeing a performance improvement is from eliminating array references and function calls, not strictly just eliminating common subexpressions. I'm looking into why we don't also eliminate the redundant calls and references (or it might be just one or the other that we're missing, I'm still investigating). The thing is, it's important to understand what's really making the difference, not just assume that eliminating CSE's by hand is always worth the cost.

I'm not saying you should never rewrite code to make it run faster, but you shouldn't make things static out of habit because you think they're faster, and you should avoid "premature optimization". You should write code that is clean and easy to maintain, and as needed modify performance critical parts as needed based on real data, not on faulty reasoning about how you think the compiler works (or doesn't).

Dale

>> common sub-expression elimination optimization simply does not work

Not having Intel C++ I cannot comment specifically on common sub-expression elimination as performed by ICPP.

In general, the effectiveness of common sub-expression elimination depends upon where in the compile process the common sub-expressions are detected and eliminated.

If you perform the detection deep into the compilation then the particular common sub-expression may have acquired use of a different set of registers at different points in the program. In this case the common sub-expression might not be detected all of the common sub-expressions.

On the otherhand, if common sub-expressions are looked for higher up in the abstraction level (prior to register allocation), then you may have more effective detection of common sub expression. Note, this does not preclude the compiler from looking for common sub-expressions at both levels.

In one of my "past lives" I wrote compiler optimizaton utilities. I was suprised that after you make a pass at optimizations that there exists a good probability that a new set of detectible optimization is present in the code. Thus in many cases, optimizations should be repetitive until done. while(anyOptimizations(codeBlock)) continue;

Jim Dempsey

www.quickthreadprogramming.com
Part of my point was that the 10.0 compiler vectorizes the code
wherethe 9.1 compiler fails to, in an apples to apples comparison, same
code, same opts.

Sure - and I just gave you an apples to apples comparison where v9.1 outperforms v10. But you don't seem to recognize that for some reason.

I haven't tried every possible combination of options and pragmas, but
so far in investigating this particular chunk of code I haven't found
9.1 to be faster on the machines I use.

Seems like that has something to do with the fact that you haven't bothered adding the pragma vector always above the inner loop so that both v10 and v9.1 vectorize it.

I haven't tried every possible combination of options and pragmas, but
so far in investigating this particular chunk of code I haven't found
9.1 to be faster on the machines I use.

I have found 9.1 faster in case of pretty much every piece of code I bothered to investigate to any signifficant extent. The benchmark times I gave were for a 1.4 P3 Tualatin. But since you think this is an unusual result, on a 1.7 P4 Celeron:

v9.1.051: 3.58s
v10.0.025: 4.19s
So the difference does remain, when both versions vectorize. Compiled with:
icpc -no-alias-args -march=pentium4 -mcpu=pentium4 -mtune=pentium4 -msse2 -xW -O3 -ansi-alias -fp-model fast=2 -rcd -align -Zp16 -fomit-frame-pointer -funroll-loops -fpic -w1 -vec-report3 -static -o test test.cxx foo.cxx

I believe the reason you're really seeing a performance improvement is
from eliminating array references and function calls, not strictly just
eliminating common subexpressions.

In the specific example I made I removed no function calls. There are still two identical calls to expf(), and a call to sinf() and cosf() each. The performance improvement is traceable directly to caching the parameters to those functions. Calculating those parameters once rather than twice DOES make a difference. Whether that is because it helps the compiler optimize something else better is debatable.

In general, the effectiveness of common sub-expression elimination
depends upon where in the compile process the common sub-expressions
are detected and eliminated.

More like whether, rather than where.

Thus in many cases, optimizations should be repetitive until done.

Agreed. But not leaving the compiler a scope for screwing up on as simple and as readability improving an optimization as this seems ill advised. Good compilers aren't a replacement for good programming when it comes to optimization.

Even when I convince the 9.1 compiler to vectorize the loop by adding a "#pragma vector always" I did not see the behavior you describe. I tried two different machines and adding the pragma brings 9.1 up to the 10.0 performance but does not exceed it.

Dale

Part of my point was that the 10.0 compiler vectorizes the code wherethe 9.1 compiler fails to, in an apples to apples comparison, same code, same opts.

Dale, that is not true. You have added an option -no-alias-args which Wordy didn't use, and you also left out #pragma ivdep (or #pragma vector always, I am not sure which one he used) thus intentionally disabling vectorization on 9.1 and allowing it in 10.0.

Wordy's point was that #pragma allows vectorization of that loop in 9.1 but not in 10.0. Your suggestion for changing command line switches to make more code vectorize with 10.0 is flawed because sometimes the code is part of a larger build and option changing is not permitted to individual developers.

Moreover, there is a thing called regression testing. In development, it is used to make sure that the new version of whatever still does all the good things that the previous version of whatever was able to do.

Frankly, I am disappointed by the amount of bugs in 10.0 (I alone have at least three of them confirmed). True, in some cases it generates better performing code but there are others where you ask yourself whether those 10-15% of speedup you are seeing with it are worth so much time wasted on troubleshooting.

As for premature optimization, perhaps the father of MMX Donald Knuth was right for popularizing the phrase "Premature optimization is the root of all evil".

However, let us not forget that the original saying came from Sir Charles Antony Richard Hoare — "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil".

His words were ripped out of context by Donald Knuth and simply misunderstood by many others. Those words all too often led software designers into serious mistakes because they have been applied to a different problem domain than to what was intended.

Here is what Charles Cook has to say about it (and I completely agree with him):

Its usually not worth spending a lot of time micro-optimizing code before its obvious where the performance bottlenecks are. But, conversely, when designing software at a system level, performance issues should always be considered from the beginning. A good software developer will do this automatically, having developed a feel for where performance issues will cause problems. An inexperienced developer will not bother, misguidedly believing that a bit of fine tuning at a later stage will fix any problems.

I've worked on systems where the architects adhered to "Hoare's Dictum". All goes well until realistic large-scale testing is performed and it becomes painfully obvious that the system is never going to scale upwards. Unfortunately by then it can be very difficult to fix the problem without a large amount of re-design and re-coding.

I also suggest reading this article written by Randall Hyde. Again, let me say that I agree with him more than I will ever agree with Hoare and Knuth.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
IgorLevicki:

Dale, that is not true. You have added an option -no-alias-args which Wordy didn't use, and you also left out #pragma ivdep (or #pragma vector always, I am not sure which one he used) thus intentionally disabling vectorization on 9.1 and allowing it in 10.0.

I said I used the same code and the same options for both 9.1 and 10.0. I used the code that was posted. Without knowing that the parameters are aliased, the compiler cannot safely vectorize it, once that is guaranteed by the option (one of several ways to do that) then the 10.0 automatically vectorizes it, while the 9.1 compiler fails to. The 9.1 compiler requires an additional pragma to get it to vectorize.

IgorLevicki:

Wordy's point was that #pragma allows vectorization of that loop in 9.1 but not in 10.0. Your suggestion for changing command line switches to make more code vectorize with 10.0 is flawed because sometimes the code is part of a larger build and option changing is not permitted to individual developers.

My point is that the 10.0 compiler doesn't need a pragma to force vectorization. If the option is not appropriate for the file, you can use pragmas to indicate that the parameters do not alias, which gives the compiler more usefulinformation than just that a particular loop should be forced to vectorize in all cases. Such information is more general and more generally useful by other parts of the compiler to improve other optimizations.

IgorLevicki:

. . . the father of MMX Donald Knuth

I trust you mean MMIX, which I believe has no relation to MMX, just to be clear. I wouldn't want any confusion on that (nor, I'm sure would Dr. Knuth :-)

IgorLevicki:

Here is what Charles Cook has to say about it (and I completely agree with him):

Its usually not worth spending a lot of time micro-optimizing code before its obvious where the performance bottlenecks are. But, conversely, when designing software at a system level, performance issues should always be considered from the beginning. A good software developer will do this automatically, having developed a feel for where performance issues will cause problems. An inexperienced developer will not bother, misguidedly believing that a bit of fine tuning at a later stage will fix any problems.

I've worked on systems where the architects adhered to "Hoare's Dictum". All goes well until realistic large-scale testing is performed and it becomes painfully obvious that the system is never going to scale upwards. Unfortunately by then it can be very difficult to fix the problem without a large amount of re-design and re-coding.

I also suggest reading this
article written by Randall Hyde
. Again, let me say that I agree with him more than I will ever agree with Hoare and Knuth.

That summary sounds right on target. I'll look into that artile, it sounds promising.

My point is that the 10.0 compiler doesn't need a pragma to force vectorization.

And my points are:

  1. 10.0 needs an additional option instead of #pragma, which is about the same (in the sense that both ways require human intervention and thus are not fully automatic) except that the #pragma is local to the loop and the option is global to the module or to the project and can possibly break things. That is why I prefer #pragma over an option.
  2. Using the same options and the same #pragma in this particular case leads to 9.1 vectorizing the code and 10.0 not vectorizing it, which is clearly a regression in 10.0, and it should be fixed.

As for Donald, you are right — I meant MMIX. Syntax error.

That summary sounds right on target. I'll look into that article, it sounds promising.

If you like the article, you might want to check his book — Write Great Code, Volume 2: Thinking Low-Level, Writing High-Level.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
I tried two different machines and adding the pragma brings 9.1 up to the 10.0 performance but does not exceed it.

Since you seem to be insisting on this:
On Core2/x86-64:

32-bit binary for P3:
v9.1.051: 2.59s
v10.0.025: 2.90s

64-bit binary for Core:
v9.1.051: 2.24s
v10.0.025: 5.42s - couldn't vectorize (?! 32-bit compiler did!). And this is with -no-alias-args AND #pragma vector always.
remark: loop was not vectorized: dereference too complex.

Both v9.1 and v10 compiled with:
icpc -no-alias-args -msse3 -xP -O3 -ansi-alias -fp-model fast=2 -rcd -align -Zp16 -fomit-frame-pointer -funroll-loops -fpic -w1 -vec-report3 -static -o test test.cxx foo.cxx
(-xP deliberately used instead of -xT because both v9.1 and v10 support -xP to allow for a fair test)

What _exactly_ did you test it on? I have so far produced test results for P3/RedHat9, P4/RedHat9 and Core2/CentOS5-x86-64. The results range from v10 being only slower all the way to v10 being broken. That's 3 out of 3, as tested on the last 3 generations of CPUs that Intel produced.

Jim:
>I will have to agree with you that some embedded processor archetectures have sizeable overhead in mucking with their "stack". >They really do not have a stack, in the traditional sense, but some instead have a rolling set of registers where these are >swapped with RAM. On x86 (32 and 64 bit) they use a stack pointer register to RAM and optionally a stack frame pointer register. >Not much to muck with.
>
>Jim Dempsey

Pretty much though I think you misunderstood what I meant slightly - i was agreeing on how variables are taken off the stack and alignment.
With regards stacks and micros: most (even embedded) micro's do have a stack pointer and limit register certainly the arm's i'm using do and even tiny micro's like msp pics and 8051's either have one or there are ways you can implement a stack using a random free register or fixed memory address to store a stack pointer.

Re swapping registers with memory directly (i'm assuming you're refering to a fixed address for a function/interupt to put registers contents when the functions entered - there really little gain over maintaining a stack unless you've got to the level of needing to shave a half a dozen clock cycles (and a few micro's context switch registers internally for interupts though most dont) though needing to shave those clock cycles is not unheard of.

I was merely saying on a trivial function called lots the extra 1 or 2 cycles could add up however on a nontrivial function there little gain, though there is no loss from writing that way, and no THEORETICAL vectorisation problems from either.

IgorLevicki: Good articles! I'd appreciate more links & posts like that :)

Dale, wordy, others: though i agree that the test wasn't entirely fair, you're optimising to very platform specific at this point so please list the platforms when quoting figures if nothing else it helps identify which archetectures are flawed and which are more optimal in which areas.

This discussion is starting to get out of hand.

Wordy, I suggest you to submit an issue for 10.0.x compiler on Premier Support. Do not forget to include reproducible test case in your problem report.

I would personally submit two issues, one for 32-bit 10.0 not vectorizing with #pragma where 9.1 does, and another one for 64-bit 10.0 not vectorizing at all. In my opinion those are two different issues, never mind that the code is the same.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Wordy, I suggest you to submit an issue for 10.0.x compiler on Premier Support. Do not forget to include reproducible test case in your problem report.

Will do. I just noticed that 10.0.026 is finally available for Linux. Before I file a bug report I'll make sure that it hasn't already been fixed.

Will do. I just noticed that 10.0.026 is finally available for Linux. Before I file a bug report I'll make sure that it hasn't already been fixed.

Sure but don't keep your hopes too high when it comes to 10.0.026.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Actually as of today we're up to 10.0.027. I'd try it out for you but who knows where that would lead :-)

Seriously, though, I doubt that this update addresses your particular concerns so you should go ahead and submit the issues if you haven't already.

Dale

Actually as of today we're up to 10.0.027. I'd try it out for you but who knows where that would lead :-)

Dale, when we will finally get "Issues fixed in this release" list with each new compiler version?

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
IgorLevicki:

Dale, when we will finally get "Issues fixed in this release" list with each new compiler version?

Your wish is my command :-) It appears to be there in the latest release. in the readme.txt file (w_cc_c_10.0.027_README.txt). The current one (10.0.027) also notes the 10.0.026 fixes.

Is that what you're looking for?

Dale

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today