~10% performance degradation on string conversion routines versus gcc (several versions)

~10% performance degradation on string conversion routines versus gcc (several versions)

Greetings,

I am trying to compare the performance of a set of very simple
string to number (integer/double) conversion routines compiled
with ICC and GCC.For example, take the string to integer conversion
code in minisat.

I tried several flag setting for both ICC and GCC and found
that the code produced with ICC consistently ran from 5%
to 10% slower than the same code compiled with GCC.

For ICC I have tried -O3, -O1, -Os.
For GCC (tried several versions up until 4.3.3) used -O3.

What flag settings would you suggest for ICC for this type
of code?

Thanks in advance,

Carlos

19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Several factors you haven't mentioned would make a lot of difference. 32-bit gcc typically uses in-lined rep string moves, which may be faster for certain string moderate string lengths. glibc 2.8 (and partial back-ports) improved the performance of certain non-inlined string moves, particularly for 64-bit linux. If there are optimizations to be gained by run-time analysis of the code, PGO may help.

Could you please share the test case source to try it out?

Quoting - tim18
Several factors you haven't mentioned would make a lot of difference. 32-bit gcc typically uses in-lined rep string moves, which may be faster for certain string moderate string lengths. glibc 2.8 (and partial back-ports) improved the performance of certain non-inlined string moves, particularly for 64-bit linux. If there are optimizations to be gained by run-time analysis of the code, PGO may help.

Hi Tim,

I am using 64 bit linux and 64 bit gcc and was my strings arerelatively short (~10 char long).

Regarding glibc 2.8 I think that explains my previous post.However, this was not a strcpy test but a
string to integer conversion test i.e. take char see if its from '0' to '9', subtract '0' add to acc*10.

I need to strip out some stuff and will post the example and compilation options for both builds later tonight.

Thans again,

Carlos

If icc is auto-vectorizing a loop which you have written, on the mistaken assumption that it has a typical length of 100, this could be an undesirable "optimization."
icc 9.1 had a loop count directive to alert the compiler to the loop length[s] for which you wish code to be optimized. Also, the -O1 option enabled auto-vectorization with minimum unrolling. Since 10.0, -O1 implies -vec- (no auto-vectorization). In all versions, #pragma no vector is available to prevent vectorization of a loop.
I think I mentioned the profile guided optimization as a possible alternative to persuade icc to optimize for actual loop lengths. I'm not fanatic enough to recommend it, but it seems necessary to consider if you have a version of icc where #pragma loop count isn't effective.
If you used gcc and didn't set any of the options for loop unrolling, you are optimizing by default for a loop length much less than 100. For x86_64, I try to make gcc unrolling more consistent with the icc preference for loop length 100 by setting
-funroll-loops --param max-unroll-times=4 -O3 (using a current version of gcc, where -O3 implies -ftree-vectorize).
I mention this only to emphasize that you may have chose gcc options which default to optimization for a short loop, while icc auto-vectorization requires a long loop length, such as 24 (longer, if it involves sum accumulation), before it performs well.

Jennifer J. (Intel)'s picture

Which version of icc are you using? did you try with /Qx option?

Jennifer

Quoting - Jennifer Jiang (Intel)

Which version of icc are you using? did you try with /Qx option?

Jennifer

Hi Jennifer,

What is the equivalent linux option?

Thanks,

Carlos

Dale Schouten (Intel)'s picture

Quoting - cfspc

Hi Jennifer,

What is the equivalent linux option?

Thanks,

Carlos

-x is the equivalent of /Qx on Windows. You can look at "icc -help" to see some of the options (e.g. -xSSE4.1) but the easiest thing to ttry is -xHost, which will "generate specialized code" for the processor you're actually compiling on.

Dale

#include 
#include 

inline double current_time()   
{   
    struct timespec tp ;   
    clock_gettime(CLOCK_REALTIME, &tp) ;       
    return ( (double) tp.tv_sec + (double) tp.tv_nsec * 1e-9 ) ;   
}   

template  
int atoi_core(char const * str, A * num, char const ** endptr) {   
    if (str == 0) return -1;   
       
    char const * in = str ;    
    A    val = 0;   
    int  neg = 0;   
  
    if      (*in == '-') neg = 1, ++in;   
    else if (*in == '+') ++in;   
    if (*in < '0' || *in > '9') return -1 ;   
  
#pragma no vector       
    while (*in >= '0' && *in <= '9') {   
       val = val*10 + (*in - '0') ;   
       ++in ;   
    }   
  
    *num =  neg ? -val : val ;   
  
    if (endptr)   
        *endptr = in ;   
       
    return 0 ;   
}   
  
template    
void test_atoi(char const * num, long const N, char const * name) {   
    long acc = 0 ;   
    double const start_time = current_time() ;   
    A value = 0 ;   
    char const * end = 0 ;   
       
    for (int n = 0 ; n < N ; ++n) {   
        atoi_core(num, &value, &end) ;
        acc += value ;   
    }   
    const double elapsed = current_time() - start_time ;   
    printf("%ld %s %gsn", acc, name, elapsed) ;   
}   
  
int main() {   
    char num[] = "1235139";   
    long const N = 10000000l ;   
    test_atoi(num, N, "atoi_core") ;   
    test_atoi(num, N, "atoi_core") ;    
}   


Here is a stripped version of the code. I compiled the version above on a Cygwin gcc in my laptop so hopefully it should compile for you (doing this over a slow vnc connection while on vacation).

The times I reported were obtained on 64bit Linux on AMD and Intel boxes.
I am compiled this with: ICCFLAGS = -O3 -Wall -g -nolib-inline GCCFLAGS = -O3 -Wall -g and tried several other settings but gcc was wither faster for the int or long version. Here are some resuls: GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-9): atoi 0.110044s atol 0.0955102s GCC 4.3.3: atoi 0.102144s atof 0.0918192s ICC 11.081 atoi 0.105865s atol 0.10049s.

In the original code the atoi_core implementation is in a different cpp as I was hoping to avoid having the
compiler "cheat" by using the fixed test string length etc.

Tim: What timer do you suggest to use instead?

Thanks

I'm not enough of a C++ and unistd guru to guess what you omitted here; a compilable example would be needed to make progress.
Considering the inevitable differences due to alignments and cache warmth, I'm not prepared to be convinced by differences of less than 10 milliseconds, even if a more repeatable timer were used.
You could play games as to whether one compiler or the other generated faster code for your unrepresentative test case when written with & or &&. You might argue that the compiler can see the length of your string and so should not "optimize" for a longer one, but that doesn't fall far short of arguing that the compiler could "cheat" by shortcutting your example.

I could not compile the testcase even with gcc. Could you please fix the code and post again?

Hi Om, Hi Tim,

I have updated the code. Sorry about the 1st post, I was trying to copy/past and edit the code over
a very slow connection that kept on acting up quite badly :-(.

In my test I have the atoi_core stuff in a different .cpp to try to prevent the compiler from "cheating".

Thanks again,

Carlos

Quoting - Dale Schouten (Intel)

-x is the equivalent of /Qx on Windows. You can look at "icc -help" to see some of the options (e.g. -xSSE4.1) but the easiest thing to ttry is -xHost, which will "generate specialized code" for the processor you're actually compiling on.

Dale

Hi Dale/Jennifer,

Sadly I can only use up to SSE2 for this.

Best regards,

Carlos

Dale Schouten (Intel)'s picture

Quoting - cfspc


Hi Dale/Jennifer,

Sadly I can only use up to SSE2 for this.

Best regards,

Carlos

Well, I just tried the corrected test case above with gcc and icc 11.0 (note that you need to add "-lrt" to the end of the command line to get it to link) and my numbers are the opposite of yours, if I'm understanding correctly:

[shell]$ g++ -O3  -g bug2.cpp -lrt -o bug2_gcc
$ icpc bug2.cpp -O3 -g -nolib-inline -lrt -o bug2_icc
bug2.cpp(23): warning #161: unrecognized #pragma
  #pragma no vector        
          ^

$ ./bug2_gcc
-935943296 atoi_core 0.713818s
-935943296 atoi_core 0.733898s
$ ./bug2_gcc
-935943296 atoi_core 0.681072s
-935943296 atoi_core 0.75437s
$ ./bug2_gcc
-935943296 atoi_core 0.639332s
-935943296 atoi_core 0.700725s
$ ./bug2_icc
-935943296 atoi_core 0.551402s
-935943296 atoi_core 0.614372s
$ ./bug2_icc
-935943296 atoi_core 0.539228s
-935943296 atoi_core 0.596264s
$ ./bug2_icc
-935943296 atoi_core 0.528656s
-935943296 atoi_core 0.617727s
$ gcc --version
gcc (GCC) 3.4.3 20050227 (Red Hat 3.4.3-22.1)
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ icc -V
Intel C Compiler Professional for applications running on IA-32, Version 11.0    Build 20090609 Package ID: l_cproc_p_11.0.084
Copyright (C) 1985-2009 Intel Corporation.  All rights reserved.

$ [/shell]

I'm not sure why my results are different, but Tim's point about the timing mechanism may be relevant. You could also increase the amount of work done so that the timer resolution is more appropriate.

Dale

Quoting - Dale Schouten (Intel)

Well, I just tried the corrected test case above with gcc and icc 11.0 (note that you need to add "-lrt" to the end of the command line to get it to link) and my numbers are the opposite of yours, if I'm understanding correctly:

[shell]$ g++ -O3  -g bug2.cpp -lrt -o bug2_gcc
$ icpc bug2.cpp -O3 -g -nolib-inline -lrt -o bug2_icc
bug2.cpp(23): warning #161: unrecognized #pragma
  #pragma no vector        
          ^

$ ./bug2_gcc
-935943296 atoi_core 0.713818s
-935943296 atoi_core 0.733898s
$ ./bug2_gcc
-935943296 atoi_core 0.681072s
-935943296 atoi_core 0.75437s
$ ./bug2_gcc
-935943296 atoi_core 0.639332s
-935943296 atoi_core 0.700725s
$ ./bug2_icc
-935943296 atoi_core 0.551402s
-935943296 atoi_core 0.614372s
$ ./bug2_icc
-935943296 atoi_core 0.539228s
-935943296 atoi_core 0.596264s
$ ./bug2_icc
-935943296 atoi_core 0.528656s
-935943296 atoi_core 0.617727s
$ gcc --version
gcc (GCC) 3.4.3 20050227 (Red Hat 3.4.3-22.1)
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ icc -V
Intel C Compiler Professional for applications running on IA-32, Version 11.0    Build 20090609 Package ID: l_cproc_p_11.0.084
Copyright (C) 1985-2009 Intel Corporation.  All rights reserved.

$ [/shell]

I'm not sure why my results are different, but Tim's point about the timing mechanism may be relevant. You could also increase the amount of work done so that the timer resolution is more appropriate.

Dale

Hi Dale,

Can you try this on a 64 bit system? All my experiments were on 64 bit.

I will try running my tests on 32 bit.

The amount of work can be increased by increasing N.

I am also not seeing as much variation from run to run.

Thanks again,

Carlos

bustaf's picture

Quoting - cfspc

Quoting - Dale Schouten (Intel)

Well, I just tried the corrected test case above with gcc and icc 11.0 (note that you need to add "-lrt" to the end of the command line to get it to link) and my numbers are the opposite of yours, if I'm understanding correctly:

[shell]$ g++ -O3  -g bug2.cpp -lrt -o bug2_gcc
$ icpc bug2.cpp -O3 -g -nolib-inline -lrt -o bug2_icc
bug2.cpp(23): warning #161: unrecognized #pragma
  #pragma no vector        
          ^

$ ./bug2_gcc
-935943296 atoi_core 0.713818s
-935943296 atoi_core 0.733898s
$ ./bug2_gcc
-935943296 atoi_core 0.681072s
-935943296 atoi_core 0.75437s
$ ./bug2_gcc
-935943296 atoi_core 0.639332s
-935943296 atoi_core 0.700725s
$ ./bug2_icc
-935943296 atoi_core 0.551402s
-935943296 atoi_core 0.614372s
$ ./bug2_icc
-935943296 atoi_core 0.539228s
-935943296 atoi_core 0.596264s
$ ./bug2_icc
-935943296 atoi_core 0.528656s
-935943296 atoi_core 0.617727s
$ gcc --version
gcc (GCC) 3.4.3 20050227 (Red Hat 3.4.3-22.1)
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ icc -V
Intel C Compiler Professional for applications running on IA-32, Version 11.0    Build 20090609 Package ID: l_cproc_p_11.0.084
Copyright (C) 1985-2009 Intel Corporation.  All rights reserved.

$ [/shell]

I'm not sure why my results are different, but Tim's point about the timing mechanism may be relevant. You could also increase the amount of work done so that the timer resolution is more appropriate.

Dale

Hi Dale,

Can you try this on a 64 bit system? All my experiments were on 64 bit.

I will try running my tests on 32 bit.

The amount of work can be increased by increasing N.

I am also not seeing as much variation from run to run.

Thanks again,

Carlos

Hi Carlos
Your remark can be true and not disputed but you must build an complete project before
you deduct or appreciate real value compiler ICC.
I like the GNU GCC and i hope in futures he can result same performance ICC
(for analyze, make trace charge server or use or insert appropriated loop (true for vectorized) in your sources,also make test with IBM blade server (14 processor or others machines less) i think you can observing results surprise .....
all potential leak are decreased, if you want ,is no obligatory to use the slow (string class members) side).
Remarks: Use last version snapshot GCC ,speed are largely increased.... (GCC is also very very well product)

Context: Linux user friends
,
Warning, last distributions Linux with kernel (using udev modules side)observed several problems with some old drivers
(makenode) and superficial ,imposed added (RULES).
Better that you not using over an old machine,
same type problems can be serious....
read with attention that you see at boot sequence.

I think can be well that you give observed difference
64b an 32b (Gcc or Icc) and exactly associed material configuration
used.
Best regards

Replying to 2 of the recent posts:
gettimeofday() (for linux), or Windows QueryPerformance Counter() , or (when using OpenMP, the likely equivalent omp_get_wtime()), or the less portable but higher resolution rdtsc() are more frequently chosen timers. Intel and Microsoft compilers support a builtin macro __rdtsc(), which relieves you from considering the 32- vs 64-bit dependencies. Under gcc, you must observe the differering inline-asm sequences for 32- or 64-bit rdtsc.
As Bustaf said, the most recent gcc-4.5 has shown improved performance, particularly with -mtune=barcelona, for theurrent AMD and Intel CPUs.
SunStudio compilers will show best performance of all compilers in a few examples, but less performance than icc or gcc in others.

jimdempseyatthecove's picture

You migh try this revision of your code

template     
int atoi_core(char const * str, A * num, char const ** endptr) {      
    if (str == 0) return -1;      
          
    char const * in = str ;       
    A    val = 0;      
    int  sign = 1;      
     
    if      (*in == '-') sign = -1, ++in;      
    else if (*in == '+') ++in;      
    if (*in < '0' || *in > '9') return -1 ;      
    val = (*in - '0') ;      
    ++in ;      
     
#pragma no vector          
    while (*in >= '0' && *in <= '9') {      
       val = val*10 + (*in - '0') ;      
       ++in ;      
    }      
     
    *num =  val * sign;      
     
    if (endptr)      
        *endptr = in ;      
          
    return 0 ;      
}      

Jim Dempsey

www.quickthreadprogramming.com
jimdempseyatthecove's picture

I forgot to ask

Would "-1" be a valid input? If so then "junk"=="-1".

You might want to consider a different value for invalid input

e.g. const int InvalidInput = 0x80000000; // (-0)

Jim Dempsey

www.quickthreadprogramming.com

Login to leave a comment.