Intel C++ v13 vs v14, new vs malloc performance

Intel C++ v13 vs v14, new vs malloc performance

Hi all,

I've run into a very strange compiler issue with some very basic HPC code (see .cpp file). The code calculates the norm of a matrix and measures the perfomance of this calculation:

double norm1_row_wise(double** const A, const int n)
{
    double *colsums= new double[n];
//   double rowsums[30000];
//   double *rowsums= (double *)calloc( n, sizeof(double ) );
    double max_norm=-1.;

    for (int i=0; i<n; ++i)
        rowsums[i]=0;
    
    for (int i=0; i<n; ++i)
        for (int j=0; j<n; ++j)
            rowsums[i] += abs(A[i][j]);

    for (int i=0; i<n; ++i)
        if (rowsums[i]>max_norm)
            max_norm= rowsums[i];

    return max_norm;
}

 

this funtion is called 5 times during the program , so the function performs either 5 new's, 5 stack allocs or 5 callocs.

With the Intel v13 C++ compiler there isn't a  notable difference between the different methods (new vs stack vs calloc).

With the Intel v14 C++ compiler, however, the 'new' method is 10 times slower on a Xeon Phi 5110P:

# ./new_vs_malloc.icc13.mic
ICC: 13.1
Using default values: matrix dimension 5000
Allocate memory 190.735 MB for storing the matrix
Compute maximum norm
Compute 1-norm (rowsums, stackvar)
Compute 1-norm (rowsums, newvar)
Compute 1-norm (rowsums, calloc)
The norm are:
 - maximum norm               1.25075e+07 in 0.056958 sec (438.92 MFlops)
 - 1-norm (rowsums, stackvar) 1.25075e+07 in 0.0571316 sec (437.586 MFlops)
 - 1-norm (rowsums, newvar)   1.25075e+07 in 0.0569432 sec (439.034 MFlops)
 - 1-norm (rowsums, calloc)   1.25075e+07 in 0.0573016 sec (436.288 MFlops)

# ./new_vs_malloc.icc14.mic
ICC: 14
Using default values: matrix dimension 5000
Allocate memory 190.735 MB for storing the matrix
Compute maximum norm
Compute 1-norm (rowsums, stackvar)
Compute 1-norm (rowsums, newvar)
Compute 1-norm (rowsums, calloc)
The norm are:
 - maximum norm               1.25075e+07 in 0.0569964 sec (438.624 MFlops)
 - 1-norm (rowsums, stackvar) 1.25075e+07 in 0.0570482 sec (438.226 MFlops)
 - 1-norm (rowsums, newvar)   1.25075e+07 in 0.562447 sec (44.4486 MFlops)
 - 1-norm (rowsums, calloc)   1.25075e+07 in 0.0571578 sec (437.386 MFlops)

 

Notice the 439.04 MFlops vs 44.4486 MFlops.

There is also a performance penalty for this code when run on a regular CPU but the difference depends greatly on the CPU model: Xeon E5's show a higher penalty then, for example, the ancient Core2 Duo E6550 inside my trusty old desktop.

What is going on here? It seems that simply using 'new' is causing a perf penalty that actually also depends on the matrix size (this can be seen when running ./new_vs_malloc 5000 vs ./new_vs_malloc 20000).

The High Energy Physics code that I'm working with uses 'new' left and right all throughout the code, so this could be (is) a major concern for us.

Thanks in advance for any info/pointers,

JJK / Jan Just Keijser

 

Fichier attachéTaille
Télécharger new_vs_malloc.cpp9.75 Ko
Télécharger new_vs_malloc.tar.gz43.05 Ko
9 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

It may well be completely irrelevant, but there's a huge memory leak in the code that uses "new", since it never does the corresponding "delete". You're also not checking the obvious version that simply has

double rowsums[n];

Also the "calloc" case doesn't need to explicitly zero the memory, since calloc already does that according the calloc man page :-

 calloc() allocates memory for an array of nmemb elements of size bytes each and returns a pointer to the allocated memory.  The memory is set to zero.

Hi James,

thanks for the reply. In my post I've been a bit quick - if you read the full new_vs_malloc.cpp code you'll see that the 'new' version does have a 'delete' at the end.

As for a

double rowsums[n];

version: it shows how much of a C programmer I am , instead of a C++ programmer  ;)  (the double rowsums[30000] version was the equivalent of this)

I've created a 'double rowsums[n]' version with the Intel C++ v14 compiler:

ICC: 14
Using default values: matrix dimension 5000
Allocate memory 190.735 MB for storing the matrix
Compute maximum norm
Compute 1-norm (rowsums, stackvar)
Compute 1-norm (rowsums, newvar)
Compute 1-norm (rowsums, calloc)
The norm are: 
 - maximum norm               1.25075e+07 in 0.0568136 sec (440.036 MFlops)
 - 1-norm (rowsums, stackvar) 1.25075e+07 in 0.0569584 sec (438.917 MFlops)
 - 1-norm (rowsums, newvar)   1.25075e+07 in 0.585085 sec (42.7288 MFlops)
 - 1-norm (rowsums, calloc)   1.25075e+07 in 0.0569826 sec (438.73 MFlops)

 

it shows how much of a C programmer I am , instead of a C++ programmer  ;) 

That's not a very good excuse :-), variable length arrays like this have been in C since the C99 standard! (Though, admittedly, C11 seems to have made them an optional feature of an implementation, so for ultimate portability you're justified in sticking with alloca to achieve the same effect of on-stack allocation).

None of which addresses your real question, I admit :-(

What you may be observing is the latency (overhead) of "First Touch". This is the mapping of the virtual memory address to either physical address or page file address on the first time the address is touched (read or written) since the process was created.

A "simple" verification of this is the switch the order in which you perform the newvar and calloc test. Or...

Place a loop around all three tests, and run them 5 times. Often the first pass will encounter such overhead.

Jim Dempsey

www.quickthreadprogramming.com

I'd already checked 'first touch' and alignment issues. It still wouldn't explain to me why icc v13 would be so much quicker than icc v14.

To make things stranger I added a 'printf' line to the function to print out the memory address of the allocated var and the performance difference is gone

double norm1_row_wise2(double** const A, const int n)
{
    double *rowsums= new double[n];
    double max_norm=-1.;

    printf("*rowsums = %p\n", rowsums);

    for (int i=0; i<n; ++i)
        rowsums[i]=0;

    //calculate the rowsums using the helper array
    for (int i=0; i<n; ++i)
        for (int j=0; j<n; ++j)
            rowsums[i] += abs(A[i][j]);

    for (int i=0; i<n; ++i)
        if (rowsums[i]>max_norm)
            max_norm= rowsums[i];

    delete[] rowsums;

    return max_norm;
}

 

yes that is the only change:

ICC: 14
Using default values: matrix dimension 5000
Allocate memory 190.735 MB for storing the matrix
Compute maximum norm
Compute 1-norm (rowsums, stackvar)
Compute 1-norm (rowsums, newvar)
*rowsums = 0xd65d6e0
*rowsums = 0xd65d6e0
*rowsums = 0xd65d6e0
*rowsums = 0xd65d6e0
*rowsums = 0xd65d6e0
Compute 1-norm (rowsums, calloc)
The norm are: 
 - maximum norm               1.25075e+07 in 0.056805 sec (440.102 MFlops)
 - 1-norm (rowsums, stackvar) 1.25075e+07 in 0.0572608 sec (436.599 MFlops)
 - 1-norm (rowsums, newvar)   1.25075e+07 in 0.0571996 sec (437.066 MFlops)
 - 1-norm (rowsums, calloc)   1.25075e+07 in 0.0573366 sec (436.022 MFlops)

 

I'm tempted to dissect the assembly code that the compiler produces. Back in the old days I could read i486 assembly code just fine but the xeon phi code I cannot , unfortunately.

 

No more responses.

Hmmmm, can I then assume that this is a (confirmed) compiler bug that hopefully will be fixed in an upcoming release?

 

I will try reproducing your findings and post another update after I know more. Sorry for the delay.
 

I reproduced your findings using 13.1 (13.1.3.192) and 14.1 (14.0.4.211). This 14.1 version I tested is the latest Composer XE 2013 SP1 Update 4 and final planned update for that release.

I also found the issue appears to already be fixed in our newest major release, Intel Parallel Studio XE 2015 (with the 15.0 compiler - 15.0.0.090 Build 20140723), released yesterday.

$ icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.090 Build 20140723
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

$ icc -mmic new_vs_malloc.cpp -o icc15.exe


[root@mic0]# ./icc15.exe
ICC: 15
Using default values: matrix dimension 5000
Allocate memory 190.735 MB for storing the matrix
Compute maximum norm
Compute 1-norm (rowsums, stackvar)
Compute 1-norm (rowsums, newvar)
Compute 1-norm (rowsums, calloc)
The norm are:
 - maximum norm               1.25075e+07 in 0.0527574 sec (473.868 MFlops)
 - 1-norm (rowsums, stackvar) 1.25075e+07 in 0.0528054 sec (473.437 MFlops)
 - 1-norm (rowsums, newvar)   1.25075e+07 in 0.0529492 sec (472.15 MFlops)
 - 1-norm (rowsums, calloc)   1.25075e+07 in 0.0531192 sec (470.639 MFlops)

I will see whether a specific fix in the 15.0 compiler can be attributed to the issue. Since Update 4 is the final CXE 2013 SP1 update planned there would be no possibility of a fix in that release. Perhaps you can use the work around you noted until it’s convenient for you to upgrade to the Intel Parallel Studio XE 2015.

 

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui