Memset/cpy slower in icpc-17 than icpc-12

Memset/cpy slower in icpc-17 than icpc-12

Hi all,

   We've noticed some strange behaviour with the Intel C++ compiler that we cannot explain. Our project is currently compiled with the Intel Compiler 12.1.0258 and we are looking at making performance improvements. One area we have identified is memory intensive parts of the code, where memset and memcpy can become performance bottlenecks. We develop for Mac, Linux and Windows and are aware that due to the Mac's system architecture memcpy is typically slower on this platform than Linux. I have noticed some improvements in Mac OS with Intel 17, however some very odd behaviour with Linux.

Bellow is a CPP example for testing timing I have been using:

#include <iostream>
#include <sys/time.h>
#include <cstdlib>
#include <cstring>

#define N_BYTES 1073741824

typedef  long long ll;

long long current_timestamp() {
    struct timeval tv; 
    gettimeofday(&tv, NULL); // get current time
    ll milliseconds = tv.tv_sec * 1000LL + tv.tv_usec / 1000;
    return milliseconds;

int main(int argc, char** argv)
	//Alloc some memory	
	char* mem_location = (char*) malloc(N_BYTES);

	//'Warm' the memory

	//Now time how long it takes to tset it to something else
	ll memset_begin = current_timestamp();
	ll memset_time = current_timestamp() - memset_begin;
	std::cout << "Memset of " << N_BYTES << " took " << memset_time << "ms\n";

	return 0;

The output is as follows:

[after setting]

$ icpc -v

icpc version 17.0.1 (gcc version 4.4.7 compatibility)

$ icpc MemsetTime.cpp -o MemsetTime-icpc17

$ ./MemsetTime-icpc17

Memset of 1073741824 took 143ms

$ g++ -v


gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)

$ g++ MemsetTime.cpp -o MemsetTime-gpp4.4.7

$ ./MemsetTime-gpp4.4.7

Memset of 1073741824 took 50ms


..and in a fresh shell, with Intel 12 (again sourcing relevant compilevars)

$icpc -v

$icpc version 12.1.0 (gcc version 4.4.7 compatibility)

$icpc MemsetTime.cpp -o MemsetTime-icpc12

$ ./MemsetTime-icpc12

Memset of 1073741824 took 50ms


So it would appear that in Linux, the Intel 17 compiler uses an implementation of memset that takes nearly 3 times as long to run as gcc and Intel 12! After looking in VTune, I see that Intel 12 is using '_intel_fast_memset' and Intel 17 is using '_intel_avx_rep_memset'. Something even stranger is that when I compile under icpc17 with -g, the timing goes down to 50ms, and uses _GI_memset.


Has anyone else experienced this? Is there something in my timing logic that is not accurately representing the time taken for memset? These examples are all on Linux, on MacOS the timings are consistently at around the 50ms point.




4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

>>// Alloc some memory
>>char* mem_location = (char*) malloc(N_BYTES);

1. It is not clear what compiler options you've used.

2. DId you try to allocate memory with _mm_malloc intrinsic function which supports aligned allocation of memory on some boundary?

3. For very critical performance measurements it is better to use RDTSC or RDTSCP instructions.

>>...when I compile under icpc17 with -g, the timing goes down to 50ms, and uses _GI_memset...

In Debug configuration many things are different and I you should spend time on performance measurements in Release configuration.

In Debug release after the function prolog Compiler will emit buffer overflow checking code so take it into account when you will be planning performance testing of function execution time.

Leave a Comment

Please sign in to add a comment. Not a member? Join today