Hi,

I tried with the following compiler options ... And then I spent a few dyas on tweaking the code for further speedup. No matter how I trialed and experimented the settings, I couldn't get further speedup. I decided to jump out of my old paradigm and to learn how other people would do the programming with high performance. Could anybody please feel free to criticize my code and give me some suggestions? The more you criticize, the more I can learn. Thank you very much for your help!

It's very confusing, for example, see the following quote from the build log. Is the loop parallelized or not at the end of the day? I have two Xeon 2.4GHz CPU supporting SSE2. Thanks!

. ry2.cpp(100): (col. 2) remark: loop was not parallelized: insufficient computational work.

. ry2.cpp(108): (col. 2) remark: LOOP WAS AUTO-PARALLELIZED.

. ry2.cpp(110): (col. 3) remark: loop was not parallelized: insufficient inner loop.

----------------------------------------

#include

#include "imsl.h"

#include "mex.h"

#define MaxNumEqns 1000000

//Intel C++ compiler options: /O3 /Qopt-report /Qopt-report-phase hlo /QxW /Qopt-report3 /Qparallel /Qpar-report3

#pragma comment(linker,"/NODEFAULTLIB:libc.lib")

static double kappa, theta, sigma2, f, v, abserr, x0, T0;

static double *pps, *pys, *pV, *pT;

static int nps, nys, nV, nT;

static double *res_Re, *res_Im;

static double y[MaxNumEqns];

void fcn (int neq, double t, double y[], double yprime[])

{

//y: 0 -- alpha_real,

// 1 -- alpha_imag,

// 2 -- beta_real,

// 3 -- beta_imag.

for (int i=0; i {

double sI=0, sR=0;

for (int k=0; k {

double tmp=1+y[i*4+2]*(-pys[k]);

double tmp2=tmp*tmp;

double tmp3=y[i*4+3]*(-pys[k]);

double tmp4=tmp3*tmp3;

sR=sR+pps[k]*tmp/(tmp2+tmp4);

sI=sI+pps[k]*y[i*4+3]*(-pys[k])/(tmp2+tmp4);

}

yprime[i*4+0] = kappa*theta*y[i*4+2]-f+f*sR;

yprime[i*4+1] = kappa*theta*y[i*4+3]-f*sI;

yprime[i*4+2] = -kappa*y[i*4+2]+0.5*sigma2*(y[i*4+2]*y[i*4+2]-y[i*4+3]*y[i*4+3]);

yprime[i*4+3] = pV[i] - kappa*y[i*4+3]+sigma2*y[i*4+2]*y[i*4+3];

}

}

void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray*prhs[] )

{

//v, T, kappa, theta, sigma2, f, ys, ps, x0, T0, abserr

//v and T can be vectors, ys and ps are vectors.

kappa=*(mxGetPr(prhs[2]));

theta=*(mxGetPr(prhs[3]));

sigma2=*(mxGetPr(prhs[4]));

f=*

(mxGetPr(prhs[5]));

nV=mxGetNumberOfElements(prhs[0]); //Number of "v"'s

pV=mxGetPr(prhs[0]);

nT=mxGetNumberOfElements(prhs[1]); //Number of "T"'s

pT=mxGetPr(prhs[1]);

nps=mxGetNumberOfElements(prhs[7]); //Number of "p"'s

pps=mxGetPr(prhs[7]);

nys=mxGetNumberOfElements(prhs[6]); //Number of "y"'s

//08/10/2007: Note: due to a typo, the y's have to be negated before passing into this function.

//08/11/2007: Note: the above restriction has been removed and the negation is no longer needed.

pys=mxGetPr(prhs[6]);

x0=*(mxGetPr(prhs[8]));

T0=*(mxGetPr(prhs[9]));

abserr=*(mxGetPr(prhs[10]));

plhs[0]=mxCreateDoubleMatrix(nV, nT, mxCOMPLEX); //The solutions of A, "v" is the row, "T" is the col.

plhs[1]=mxCreateDoubleMatrix(nV, nT, mxCOMPLEX); //The solutions of B, "v" is the row, "T" is the col.

res_Re=mxGetPr(plhs[0]); //These are the outputs to be passed back to Matlab

res_Im=mxGetPi(plhs[0]);

int nstep;

char *state;

double t = 0.0; /* Initial time */

for (int i=0; i

imsl_d_ode_runge_kutta_mgr(IMSL_ODE_INITIALIZE, &state, IMSL_TOL, abserr, 0);

for (int j=0; j

imsl_d_ode_runge_kutta_mgr(IMSL_ODE_RESET, &state, 0);

for (int j=0; j {

for (int i=0; i {

res_Re[j*nV+i]=exp(y[i*4+0]+y[i*4+2]*x0)*cos(y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);

res_Im[j*nV+i]=exp(y[i*4+0]+y[i*4+2]*x0)*sin(y[i*4+1]+y[i*4+3]*x0+T0*pV[i]);

}

}

return;

}

Build log:

---------------------------------------

------ Rebuild All started: Project: try2, Configuration: Release Win32 ------

Deleting intermediate files and output files for project 'try2', configuration 'Release|Win32'.

Compiling with Intel C++ 10.0.025 [IA-32]... (Intel C++ Environment)

icl: command line warning #10120: overriding '/O2' with '/O3'

icl: command line warning #10121: overriding '/Qvc8' with '/Qvc7.1'

stdafx.cpp

DllMain.cpp

procedure: DllMain

procedure: DllMain

try2.cpp

. ry2.cpp(95): warning #177: variable "nstep" was declared but never referenced

int nstep;

^

. ry2.cpp(12): warning #177: variable "v" was declared but never referenced

static double kappa, theta, sigma2, f, v, abserr, x0, T0;

&nb

sp; ^

procedure: mexFunction

procedure: mexFunction

HLO REPORT LOG OPENED ON Sat Aug 11 16:45:23 2007

<. ry2.cpp;-1:-1;hlo;_mexFunction;0>

High Level Optimizer Report (_mexFunction)

Adjacent Loops: 3 at line 100

Unknown loop at line #104

QLOOPS 4/4 ENODE LOOPS 4 unknown 1 multi_exit_do 0 do 3 linear_do 3

LINEAR HLO EXPRESSIONS: 38 / 61

------------------------------------------------------------------------------

. ry2.cpp(100): (col. 2) remark: loop was not parallelized: insufficient computational work.

. ry2.cpp(108): (col. 2) remark: LOOP WAS AUTO-PARALLELIZED.

. ry2.cpp(110): (col. 3) remark: loop was not parallelized: insufficient inner loop.

. ry2.cpp(100): (col. 2) remark: LOOP WAS VECTORIZED.

. ry2.cpp(110): (col. 3) remark: LOOP WAS VECTORIZED.

. ry2.cpp(110): (col. 3) remark: LOOP WAS VECTORIZED

<. ry2.cpp;104:104;hlo_scalar_replacement;in _mexFunction;0>

#of Array Refs Scalar Replaced in _mexFunction at line 104=1

<. ry2.cpp;108:108;hlo_linear_trans;_mexFunction;0>

Loop Interchange not done due to: User Function Inside Loop Nest

Advice: Loop Interchange, if possible, might help Loopnest at lines: 108 110

: Suggested Permutation: (1 2 ) --> ( 2 1 )

procedure: fcn

procedure: fcn

<. ry2.cpp;-1:-1;hlo;?fcn@@YAXHNQAN0@Z;0>

High Level Optimizer Report (?fcn@@YAXHNQAN0@Z)

QLOOPS 2/2 ENODE LOOPS 2 unknown 0 multi_exit_do 0 do 2 linear_do 2

LINEAR HLO EXPRESSIONS: 69 / 101

------------------------------------------------------------------------------

.

. ry2.cpp(26): (col. 2) remark: loop was not parallelized: existence of parallel dependence.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 41.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence

between yprime line 40, and (unknown) line 32.

. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between (unknown) line 32, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 32.

. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between y line 32, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and (unknown) line 34.

. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between (unknown) line 34, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 34.

. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between y line 34, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and (unknown) line 37.

. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between (unknown) line 37, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 37.

. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between y line 37, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 40, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 40, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 40, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 40, and yprime line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and y line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 40, and yprime line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and y line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 41, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI depende

nce between y line 42, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 42, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven ANTI dependence between y line 43, and yprime line 40.

. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between (unknown) line 37, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and (unknown) line 37.

. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between (unknown) line 37, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and (unknown) line 37.

. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between (unknown) line 37, and yprime line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and (unknown) line 37.

. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between (unknown) line 37, and yprime line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and (unknown) line 37.

. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between y line 37, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 37.

. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between y line 37, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 37.

. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between y line 37, and yprime line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and y line 37.

. ry2.cpp(37): (col. 4) remark: parallel dependence: proven ANTI dependence between y line 37, and yprime line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and y line 37.

. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between (unknown) line 34, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and (unknown) line 34.

. ry2.cpp(34): (col. 27) remark: paralle

l dependence: proven ANTI dependence between (unknown) line 34, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and (unknown) line 34.

. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between (unknown) line 34, and yprime line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and (unknown) line 34.

. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between (unknown) line 34, and yprime line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and (unknown) line 34.

. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between y line 34, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 34.

. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between y line 34, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 34.

. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between y line 34, and yprime line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and y line 34.

. ry2.cpp(34): (col. 27) remark: parallel dependence: proven ANTI dependence between y line 34, and yprime line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and y line 34.

. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between (unknown) line 32, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and (unknown) line 32.

. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between (unknown) line 32, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and (unknown) line 32.

. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between (unknown) line 32, and yprime line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and (unknown) line 32.

. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between (unknown) line 32, and yprime line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and (unknown) line 32.

. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between y line 32, and yprime line 40.

. ry2.cpp(40): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 40, and y line 32.

. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between y line 32, and yprime line 41.

. ry2.cpp(41): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 41, and y line 32.

. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence between y line 32, and yprime line 42.

. ry2.cpp(42): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 42, and y line 32.

. ry2.cpp(32): (col. 28) remark: parallel dependence: proven ANTI dependence betwe

en y line 32, and yprime line 43.

. ry2.cpp(43): (col. 3) remark: parallel dependence: proven FLOW dependence between yprime line 43, and y line 32.

. ry2.cpp(30): (col. 3) remark: loop was not parallelized: insufficient computational work.

. ry2.cpp(30): (col. 3) remark: LOOP WAS VECTORIZED

<. ry2.cpp;26:26;hlo_linear_trans;?fcn@@YAXHNQAN0@Z;0>

Loop Interchange not done due to: Imperfect Loop Nest (Either at Source or due to other Compiler Transformations)

Advice: Loop Interchange, if possible, might help Loopnest at lines: 26 30

: Suggested Permutation: (1 2 ) --> ( 2 1 )

...

---------------------- Done ----------------------

Rebuild All: 1 succeeded, 0 failed, 0 skipped