Hi all,

I've run into a very strange compiler issue with some very basic HPC code (see .cpp file). The code calculates the norm of a matrix and measures the perfomance of this calculation:

double norm1_row_wise(double** const A, const int n) { double *colsums= new double[n]; // double rowsums[30000]; // double *rowsums= (double *)calloc( n, sizeof(double ) ); double max_norm=-1.; for (int i=0; i<n; ++i) rowsums[i]=0; for (int i=0; i<n; ++i) for (int j=0; j<n; ++j) rowsums[i] += abs(A[i][j]); for (int i=0; i<n; ++i) if (rowsums[i]>max_norm) max_norm= rowsums[i]; return max_norm; }

this funtion is called 5 times during the program , so the function performs either 5 new's, 5 stack allocs or 5 callocs.

With the Intel v13 C++ compiler there isn't a notable difference between the different methods (new vs stack vs calloc).

With the Intel v14 C++ compiler, however, the 'new' method is **10 times** slower on a Xeon Phi 5110P:

# ./new_vs_malloc.icc13.mic ICC: 13.1 Using default values: matrix dimension 5000 Allocate memory 190.735 MB for storing the matrix Compute maximum norm Compute 1-norm (rowsums, stackvar) Compute 1-norm (rowsums, newvar) Compute 1-norm (rowsums, calloc) The norm are: - maximum norm 1.25075e+07 in 0.056958 sec (438.92 MFlops) - 1-norm (rowsums, stackvar) 1.25075e+07 in 0.0571316 sec (437.586 MFlops) - 1-norm (rowsums, newvar) 1.25075e+07 in 0.0569432 sec (439.034 MFlops) - 1-norm (rowsums, calloc) 1.25075e+07 in 0.0573016 sec (436.288 MFlops) # ./new_vs_malloc.icc14.mic ICC: 14 Using default values: matrix dimension 5000 Allocate memory 190.735 MB for storing the matrix Compute maximum norm Compute 1-norm (rowsums, stackvar) Compute 1-norm (rowsums, newvar) Compute 1-norm (rowsums, calloc) The norm are: - maximum norm 1.25075e+07 in 0.0569964 sec (438.624 MFlops) - 1-norm (rowsums, stackvar) 1.25075e+07 in 0.0570482 sec (438.226 MFlops) - 1-norm (rowsums, newvar) 1.25075e+07 in 0.562447 sec (44.4486 MFlops) - 1-norm (rowsums, calloc) 1.25075e+07 in 0.0571578 sec (437.386 MFlops)

Notice the 439.04 MFlops vs 44.4486 MFlops.

There is also a performance penalty for this code when run on a regular CPU but the difference depends greatly on the CPU model: Xeon E5's show a higher penalty then, for example, the ancient Core2 Duo E6550 inside my trusty old desktop.

What is going on here? It seems that simply using 'new' is causing a perf penalty that actually also depends on the matrix size (this can be seen when running ./new_vs_malloc 5000 vs ./new_vs_malloc 20000).

The High Energy Physics code that I'm working with uses 'new' left and right all throughout the code, so this could be (is) a major concern for us.

Thanks in advance for any info/pointers,

JJK / Jan Just Keijser