support of gcc vector extensions required

support of gcc vector extensions required

Hello,

I have a code thtat uses gcc vector extensions and achieves 80% of the nominal peak performance with avx vectors.
The gcc vector extensions allow me to write explictely vectorize code, with only very few intrisics (sum, products and the like are all simply written a+b, a*b, etc...)
The performance is awsome, but when compiled with icc, it falls back to scalar data-types, and it turns out that the performance is horrible, nearly four times slower (reaching 20% of the nominal peak performance).
Clearly despite trying hard, icc is not able to vectorize my inner loops correctly.

It would not bother me that much, because gcc is available almost everywhere, but currently I'm trying to run this on the mic (xeon phi), but I have to go through icc, which leads to poor vectorization, and poor performance. (20% of the peak of the mic will be less than 80% of the peak of a 16-core avx machine...)

Please, support gcc vector extensions in icc !

http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

36 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,

Thank you for submitting the issue. I will file a feature request for you.

Thanks,

Shenghong

>>...The performance is awsome, but when compiled with icc, it falls back to scalar data-types, and it turns out that
>>the performance is horrible, nearly four times slower (reaching 20% of the nominal peak performance).
>>Clearly despite trying hard, icc is not able to vectorize my inner loops correctly.

I don't think that you've used All capabilities of Intel C++ compiler and you have Not provided any test cases with performance numbers. At the same time I admit that GCC-like C++ compilers, for example, MinGW for Windows, are doing a good job when it comes to performance.

I just completed a verification with version 12 ( using a latest update ) of Intel C++ compiler and here are technical details with real performance numbers:

Vec_samples.zip was used from ..\Composer XE\Samples\en_US\C++ folder ( for a Windows platform )

[ Test 1 - Generic settings - Release ]

ROW:256 COL: 256
Execution time is 12.750 seconds
GigaFlops = 0.673720
Sum of result = 1279224.000000

[ Test 2 - Vectorization & Alignment & Inlining & IPO & /O3 are used - Release ]

ROW:256 COL: 256
Execution time is 4.734 seconds
GigaFlops = 1.814519
Sum of result = 1279224.000000

As you can see Test 2 is ~2.7 times faster then Test 1.

Hello Sergey,

I don't see the relation of your test case with icc's lack of gcc vector extensions. Anyway, you're right I have not provided a test case. So here we go with SHTns, a real-world library used in high-performance computing.

This last line will run a timing program which will display several execution times.

Repeat by replacing CC=icc when running the configure script. Feel free to modify the Makefile with any options you'd like.
See how the executable produced by gcc is about 3 times faster than icc (assuming you have an AVX cpu).

Cheers.

>>So here we go with SHTns, a real-world library used in high-performance computing.
>>
>>- download the latest version : https://bitbucket.org/nschaeff/shtns/get/default.zip
>>- ./configure --enable-openmp --enable-mkl CC=gcc
>>- make
>>- make time_SHT
>>- ./time_SHT 1023 -fly -iter=5 -oop
>>
>>This last line will run a timing program which will display several execution times.
>>
>>Repeat by replacing CC=icc when running the configure script. Feel free to modify the Makefile with any options you'd like.
>>See how the executable produced by gcc is about 3 times faster than icc (assuming you have an AVX cpu).

Thanks for the information about SHTns. I think Intel software engineers of Intel C++ compiler team should look at the library and test cases because 3x difference is impressive.

And, if you did some measurements please post results in order to demonstrate to everybody that there are significant improvements in code generation of GCC C++ compiler.

Hi Nathanael,

Thank you for your update. Intel compiler actually does support some of vector operations.
For example:

-bash-4.1$ cat > gnu.c

typedef int v4si __attribute__ ((vector_size (16)));
v4si a, b, c;

void func()
{
    c = a + b;
}
^C
-bash-4.1$ icc -S gnu.c

ICC generates the following code:

func:
..B1.1:                        # Preds ..B1.0
..___tag_value_func.1:                                          #6.1
        movdqa    a(%rip), %xmm0                                #7.6
        paddd    b(%rip), %xmm0                                #7.14
        movdqa    %xmm0, c(%rip)                                #7.6
        ret                                     

The list of supported expressions grows from older ICC version to the newer ICC versions.

13.0 compiler has only the initial supports of vector operatios: +, -, *, unary -, /
14.0 compiler supports more operations (compari and supports more types of vectors (vector_length x vector_element).

If you can provide a list of vector operations you want them to be supported first (or vector operations which are generated with bad performance), it will be better as we can implemented them according the priority from your list.

Regarding the SHTns test case, I will have a try, to see why ICC is generating bad code.

Thanks,
Shenghong

Hi Nathanael,

I get below errors while building your test case (.hg not found):
gcc -march=native -O2 -I/usr/local/include -L/usr/local/lib -ffast-math -fomit-frame-pointer -std=gnu99 -fopenmp  -D_GNU_SOURCE -L/usr/local/lib -I/usr/local/include -O2 -D_HGID_="\"`hg id -ti`\"" -c sht_init.c -o sht_init.o
abort: there is no Mercurial repository here (.hg not found)
In file included from sht_init.c:31:
sht_private.h:29:19: error: fftw3.h: No such file or directory

Any suggestions for this?

By the way, can this test case prove the bad performance caused by vector operations? Again, please help to list the vector operations which are used in your test case and will cause bad performance, I may want to check whether the generated code of these vector operations are correct and report issues to developers if yes. I need first of all to know whether the bad performance of ICC is because of vector operations support, or other possible reasons.

Thanks,
Shenghong

Hi Shenghong,

>>...sht_private.h:29:19: error: fftw3.h: No such file or directory

If you have MKL on your development / test computer try to use fftw3.h from:

[ ICCInstallDir ]\Composer XE\MKL\Include\fftw

folder.

Hello Shenghong,

This is good news. I actually need only simple arithemtic operations, so the support in version 13 should be enough.
Is there documentation somewhere for this vector support ?

When changing your example file into :

#include <pmmintrin.h>
typedef double v2d __attribute__ ((vector_size (16)));
v2d a, b, c;
void func()
{
    a = _mm_set1_pd(1.0);
    c = a + b;
    b = c + _mm_set1_pd(1.0);
}

The last line " b = c + _mm_set1_pd(1.0);" reports an error, while the first line a = _mm_set1_pd(1.0); works well. Why is it so ?
What is the preferred way to set all elements of a vector to the same value ?

PS: for the reported error, Sergey gave the right answer, thank you Sergey.

Quote:

Sergey Kostrov wrote:

>>So here we go with SHTns, a real-world library used in high-performance computing.
>>
>>- download the latest version : https://bitbucket.org/nschaeff/shtns/get/default.zip
>>- ./configure --enable-openmp --enable-mkl CC=gcc
>>- make
>>- make time_SHT
>>- ./time_SHT 1023 -fly -iter=5 -oop
>>
>>This last line will run a timing program which will display several execution times.
>>
>>Repeat by replacing CC=icc when running the configure script. Feel free to modify the Makefile with any options you'd like.
>>See how the executable produced by gcc is about 3 times faster than icc (assuming you have an AVX cpu).

Thanks for the information about SHTns. I think Intel software engineers of Intel C++ compiler team should look at the library and test cases because 3x difference is impressive.

And, if you did some measurements please post results in order to demonstrate to everybody that there are significant improvements in code generation of GCC C++ compiler.

Here are the results I obtain with 16 threads on a 16-core SandyBridge machine (2.6 GHz if I recall well) :

  • ICC (auto-vectorization)
    ./time_SHT 1023 -fly -iter=5 -oop
    synthesis = 29ms, analysis=27ms
  • GCC (vectorization using gcc vector extensions)
    ./time_SHT 1023 -fly -iter=5 -oop

    synthesis = 8.7ms, analysis = 8.4ms  [which corresponds to more than 80% of the peak flops]

With icc 13, I still cannot compile the vector extensions in SHTns due to a strange "internal error" reported by icc :

internal error: 0_1000_3

which does not help a lot...

Hi Shenghong,

As mentionned by Sergey, you have to include the path PATH_FFTW=.../composer_version/mkl/include/fftw as -I$PATH_FFTW
You also need to comment the #undef _GCC_VEC_ in sht_private.h in order to compiler with gcc vec extensions (the program detects wether it is compiled with icc or gcc and disables gcc vec extensions for icc).

Errors still occur during compilation with icc and gcc vector extensions.

Sincerely,

Vincent

Hi Vincent,

Thank you for these information, I will check according to your suggestions, and update here.

Thanks

Shenghong

You have a compilation error because _mm_set1_pd doesn't return a value of type double. That intrinsic function is declared as follows:

[ emmintrin.h ]
...
extern __m128d __ICL_INTRINCC _mm_set1_pd( double );
...
and there is No decleared / implemented C++ operator +.

Please try to use API from dvec.h header file because this is:
...
Definition of a C++ class interface to Intel(R) Pentium(R) 4 processor SSE2 intrinsics.
...

Quote:

Nathanael S. wrote:

Hello Shenghong,

This is good news. I actually need only simple arithemtic operations, so the support in version 13 should be enough.
Is there documentation somewhere for this vector support ?

When changing your example file into :

#include <pmmintrin.h>
typedef double v2d __attribute__ ((vector_size (16)));
v2d a, b, c;
void func()
{
    a = _mm_set1_pd(1.0);
    c = a + b;
    b = c + _mm_set1_pd(1.0);
}

The last line " b = c + _mm_set1_pd(1.0);" reports an error, while the first line a = _mm_set1_pd(1.0); works well. Why is it so ?
What is the preferred way to set all elements of a vector to the same value ?

PS: for the reported error, Sergey gave the right answer, thank you Sergey.

Please try to use like: b = a + (v2d)_mm_set1_ps(200.0);

I have verified and GCC can accept the conversion, but ICC cannot. I will repor it to developer to see whether it can be supported/fixed.

Note: I have verified and the conversion will not cause any error (I print the results and all are as expected).

void func() {
// a=(v2d){1.0f,2.0f,3.0f,4.0f}; // works
    a = _mm_set1_ps(100.0);   // works
//  b = a + _mm_set1_ps(200.0);  // gcc works, icc not work
    b = a + (v2d)_mm_set1_ps(200.0);  // works
}

Feel free to let me know all the unsupported cases of vector operations you meet, as developer may need specific test case to fix. It will save me some time to find our the unsupported code in your test case, as your project has lots of code. Appreciate much.

Thanks,

Shenghong

Hi Sergey,

Quote:

Sergey Kostrov wrote:

You have a compilation error because _mm_set1_pd doesn't return a value of type double. That intrinsic function is declared as follows:

I don't want a double, I want a vector of double. An explicit cast to my vector type (v2d) solves the compilation problem of this tiny example, and generates the correct machine instructions. It is strange that this explicit cast is needed though. All this shows that icc has indeed support for vector extensions, which is great news. However, in a full scale application (SHTns), strange "internal errors" reported above make the compilation fail, and I have no clue of what to do with these ! So something seems broken...

Quote:

Sergey Kostrov wrote:

and there is No decleared / implemented C++ operator +.

You do not seem to know the vector extensions that are the heart of this discussion. Vector extensions DO define the natural behaviour of arithmetic operations. It is not C++, just an extension of C (you could compare it to what OpenCL does with vector types)

Quote:

Sergey Kostrov wrote:

Please try to use API from dvec.h header file because this is:
...
Definition of a C++ class interface to Intel(R) Pentium(R) 4 processor SSE2 intrinsics.

Thanks for the suggestion, but it is not suitable for at least two reasons:

  • it is a C++ class (my code is in plain C)
  • the header found in dvec.h says : "Speed and accuracy are sacrificed for utility." which completely defeats my purpose.

Quote:

shenghong-geng (Intel) wrote:

Please try to use like: b = a + (v2d)_mm_set1_ps(200.0);

I have verified and GCC can accept the conversion, but ICC cannot. I will repor it to developer to see whether it can be supported/fixed.

Feel free to let me know all the unsupported cases of vector operations you meet, as developer may need specific test case to fix. It will save me some time to find our the unsupported code in your test case, as your project has lots of code. Appreciate much.

Thanks,

Shenghong

Thank you very much for your help.

Yes, using the explicit cast to (v2d) works in your example. I have added explicit conversions in my code, which supresses all the errors.
However, the compilation still fails with "internal error 0_1000_3". No clue what that is... can you get more information on this ?

Hi Nathanael,

Great that you have added that to your code. I am doing the same work, but I need to update one by one and compile and compile again and again...is it possible for yout to share with me the updated code with explicit cast for your code? And I will investigate the internal eror you have.

 

By the way, below are the places I have add explicit cast, but still there are some similiar errors:

in sht_private.h:
 #define vdup(x) (v2d)_mm_set1_pd(x)

in sht_func.c(135, 137):
 ((v2d*)q0)[(ntheta-1-k)*lmax +(l-1)] += (v2d)_mm_xor_pd(sgnt, qc);
 ((v2d*)q0)[(ntheta+k)*lmax +(l-1)] += (v2d)_mm_xor_pd( sgnt, qc );

in sht_private.h:
 #define vxchg(a) (s2d)_mm_shuffle_pd(a,a,1)
 #define vall(x) (rnd)_mm256_set1_pd(x)
 _mm256_loadu_pd
 _mm256_set1_pd

Thanks,

Shenghong

Quote:

shenghong-geng (Intel) wrote:

Hi Nathanael,

Great that you have added that to your code. I am doing the same work, but I need to update one by one and compile and compile again and again...is it possible for yout to share with me the updated code with explicit cast for your code? And I will investigate the internal eror you have.

Sure, please find attached:

  • SHT.tar.gz should replace the content of the SHT sub-directory
  • sht_private.h should replace the file of the corresponding name.

The whole library is not converted yet, but if you type

"make sht_ltr.o"

you should read :

SHT/spat_to_SHst_fly.c(124) (col. 6): internal error: 0_1000_3

Your help is much appreciated.

And here are the files ...

Attachments: 

AttachmentSize
Downloadapplication/octet-stream sht.tar.gz57.93 KB
Downloadtext/x-chdr sht-private.h13.8 KB

Hi Nathanael,

I can reproduce the internal error with 13.1 compiler. It seems to be related to below peice of code (I expanded the macros):

v2d reo[2*NLAT_2];
double *zl;

q0  += ((double*)reo)[2*(i)] * vdup(zl[i]);

In my understanding, this is easy to build a small test case (just from these code). But when I verify the issue with 14.0 compiler (in beta now), it is fixed. is it possible you try the 14.0 beta, and wait for 14.0 release?

Thanks,

Shenghong 

Quote:

shenghong-geng (Intel) wrote:

the issue with 14.0 compiler (in beta now), it is fixed. is it possible you try the 14.0 beta, and wait for 14.0 release?

Ok, so I've made a few changes so that icc 14 can compile my code with the gcc vector extensions, on a 16 core SandyBridge machine.
Here are the timing results for a test case:  ./time_SHT 1023 -iter=5


GCC 4.4.6 :   synthesis = 7.6 ms     analysis = 7.3 ms

ICC 14.0 :     synthesis = 9.9 ms     analyisis = 7.5 ms

As you can see from these numbers, ICC 14 with vector extensions is now almost on par with GCC for the analysis, but it is still 30% slower for the synthesis.

Anyway, compared to the auto-vectorized code (without vector extensions), we gained a factor 3 in performance.
We are now trying to port it to MIC and its 512 bit vectors, and we will report our findings.

Two suggestions for you guys at intel:

  • the support for vector extensions should be mentioned somewhere on this page,
  • the need for explicit cast between vector types and _mmXXX types should be considered for removal.

Thank you for your help !

Thanks for these numbers.

>>GCC 4.4.6 : synthesis = 7.6 ms analysis = 7.3 ms
>>ICC 14.0 : synthesis = 9.9 ms analyisis = 7.5 ms

For the analysis case GCC outperforms ICC for 2.66% ( almost by 3% ) and based on my statistics I would consider it as acceptible. If it would be greater than 5% than improvements in code generation are really needed.

Quote:

Nathanael S. wrote:

Quote:

Two suggestions for you guys at intel:

  • the support for vector extensions should be mentioned somewhere on this page,
  • the need for explicit cast between vector types and _mmXXX types should be considered for removal.

Hi Nathanael,

Regarding your suggestions, I will submit a request to add some description in the document. Regarding the 2nd suggestion, I already submitted to developer to fix. I will update you when it is fixed.

Also, for ther performance issue, I will do some investigation, but if you can help (becuase you are more familiar with the code) to provide some samler test cases to show bad performance of ICC (related to specific vector extension? or related to specific loop? etc.), it will be helpfu. I will update you when I make some progress.

Thanks,

Shenghong

Hello,

An update on the matter: the previous results where obtained using -O3 optimization. When using -O2, the synthesis comes on par with gcc while the analysis slows down a little :

  • GCC -O3 : synthesis = 7.6ms, analysis = 7.3ms
  • ICC -O3 : synthesis = 9.9ms,  analysis = 7.5ms
  • ICC -O2 : synthesis = 7.6ms,  analysis = 8.6 ms

I suspect that with -O3 option, icc merges loops that should not be merged. If you want to look at the code generated, the relevant function name for synthesis is SH_to_spat_fly2_l

>>...I suspect that with -O3 option, icc merges loops that should not be merged...

You could compare generated codes. Intel C++ compiler /O3 option does more optimizations compared to /O2 option and in one case reviewed recently usage of /O3 did not improve performance at all.

Quote:

Nathanael S. wrote:

Hello,

An update on the matter: the previous results where obtained using -O3 optimization. When using -O2, the synthesis comes on par with gcc while the analysis slows down a little :

  • GCC -O3 : synthesis = 7.6ms, analysis = 7.3ms
  • ICC -O3 : synthesis = 9.9ms,  analysis = 7.5ms
  • ICC -O2 : synthesis = 7.6ms,  analysis = 8.6 ms

I suspect that with -O3 option, icc merges loops that should not be merged. If you want to look at the code generated, the relevant function name for synthesis is SH_to_spat_fly2_l

Beginning with 12.0 release, there is no apparent cost/benefit analysis for loop fusion, and "#pragma nofusion" was introduced as the option to place a barrier against fusion (replacing prior usage of "#pragma distribute point" for this purpose).

 Fusion should be reported in opt-report.  Common cases where you should test with #pragma nofusion may be:

a) 2nd loop uses result of 1st loop with alignment offset.  Best is to adjust loops for alignment by explicit peeling.

b) fusion suppresses partial vectorization

c) .....

Discussion of the beta compiler is discouraged on this forum.  You should report issues with it on premier.intel.com in hope they may be taken up later and perhaps bring them up when they appear in a release.

Hello,

Is there an mkl support for MIC in the latest compiler version (14 beta) ?
I can't find the path /opt/intel/composerxxx/mkl/lib/mic/ which was present in previous compiler versions.
So for the moment, the code compiles and works fine on CPU but can't be compiled with latest version for MIC.

What's funny is that apparently previous icc versions do compile the gcc vector extensions code for MIC since no errors are returned and executable is created and runs on MIC !

So earlier versions than 14 compile gcc vector extensions for MIC but not for CPU ? What's the trick ?

And 14 beta version doesn't have mkl support for MIC?

Thank you,

Try /opt/intel/beta/composerxxx/mkl/lib/mic/ for the library path in the beta compilers ;).

 

>>...Is there an mkl support for MIC in the latest compiler version (14 beta) ?

If you're a Beta tester of Intel C++ compiler version 14 than you could ask that question on Intel Premier Support website. Regarding MKL support for MIC review the latest Release Notes.

Quote:

vincent b. wrote:

Is there an mkl support for MIC in the latest compiler version (14 beta) ?
I can't find the path /opt/intel/composerxxx/mkl/lib/mic/ which was present in previous compiler versions.

As we've said, the place to report this is premier.intel.com, and I've done so.  Looks like an oversight, in my opinion one which must be fixed before release.

The stated intention for that beta compiler is that MKL for MIC should be installed by re-entering the menu after initial installation and using the modify option to add these libraries. This is planned to change before release.   I'm still falling back on the 13.1 released compiler installation for those libraries.

Hi Nathanael,

I have some issues to build the code with ICC, but it can work with GCC. I am not sure whether there are some configuration issues. See my build script:

cd nschaeff-shtns-341463846739
PATH_FFTW=/opt/intel/composer_xe_2013_sp1.0.040/mkl/include/fftw/

#./configure --enable-openmp --enable-mkl CC=gcc CFLAGS=-I$PATH_FFTW
./configure --enable-openmp --enable-mkl CC=icc CFLAGS=-I$PATH_FFTW
make clean
make
#make
make time_SHT
./time_SHT 1023 -fly -iter=5 -oop

I am building on a 64bit machine, and the errors I get are:

./libshtns.a(sht_init.o): In function `SH_to_point':
sht_init.c:(.text+0x72a2): undefined reference to `__builtin_ia32_vec_ext_v2df'
sht_init.c:(.text+0x72c0): undefined reference to `__builtin_ia32_vec_ext_v2df'
./libshtns.a(sht_init.o): In function `SH_to_grad_point':
sht_init.c:(.text+0x7856): undefined reference to `__builtin_ia32_vec_ext_v2df'
sht_init.c:(.text+0x7880): undefined reference to `__builtin_ia32_vec_ext_v2df'
sht_init.c:(.text+0x78ca): undefined reference to `__builtin_ia32_vec_ext_v2df'
./libshtns.a(sht_init.o):sht_init.c:(.text+0x78e4): more undefined references to `__builtin_ia32_vec_ext_v2df' follow
make: *** [time_SHT] Error 1

Seems like it should work on ia32bit mode only? should I use ICC IA32 to build (i tried, but some other issues missing compatible libraries for IA32)?

By the way, if you can isolate a small test case to show the performance issue, that will be better. I am wondering whther it is caused by wrong code generated for vector extension.

Thanks,

Shenghong

Hi Shenghong,

Yes, these are errors that I also had to fix. You should download the latest revision:

https://bitbucket.org/nschaeff/shtns/get/tip.tar.gz

Concerning the small test case, it will be difficult I think. Both analysis and synthesis share the same loop structure, only the operations therin are a little different.
But I will try to pinpoint the exact loop that has performance issues with -O3.

regards,

Hi,

I get the same error with this version. You may confirm whether this is the latest version, as I still need to change code in 'sht_func.c' to fix the issue we mentioned (cast convention). Maybe you did not check in your code?

If you could pinpoint the loop which generated bad performance code, that is also helpful, I guess we can add a timing function on it to make sure the issue is related to a single loop.

Thanks,

Shenghong

Quote:

Nathanael S. wrote:

Hi Shenghong,

Yes, these are errors that I also had to fix. You should download the latest revision:

https://bitbucket.org/nschaeff/shtns/get/tip.tar.gz

Concerning the small test case, it will be difficult I think. Both analysis and synthesis share the same loop structure, only the operations therin are a little different.
But I will try to pinpoint the exact loop that has performance issues with -O3.

regards,

Hi,

I'm really sorry, I forgot to include the last sht_func.c
Can you try again with the same link (updated) ?

Quote:

shenghong-geng (Intel) wrote:

Hi,

I get the same error with this version. You may confirm whether this is the latest version, as I still need to change code in 'sht_func.c' to fix the issue we mentioned (cast convention). Maybe you did not check in your code?

Hello,

Thank you for the comments about MIC and beta version of the compiler.

Now, I'm facing another problem : Nathanael's program has been developped for SSE / AVX. So, at some point he works on 128 bits vectors (composed of two doubles). I wish to keep that vector format of 128 bits while porting this code on MIC. Since there is no support for SSE nor AVX on MIC, is there any intrinsic to cast these 128 bits vectors into 512 bits vectors ? I couldn't find one.

Also, gcc vector extensions work well on MIC even for 128 vectors.
Here's a code example that works fine on MIC :
typedef double v2d __attribute__ (( vector_size(8*2) ));
typedef union { v2d i; double v[2]; } vec_v2d;
v2d a = {1,1};
v2d b = {4,5};
v2d c = a + b;
vec_v2d d;
d.i = c;
printf("%lf,%lf\n",d.v[0],d.v[1]);

gives as output :
5,6

So gcc vector extensions are ok for simple operations such as +,-,*,/, etc
However, when more complex instructions are needed we're stuck with the fact that no instructions casting from 128 bits to 512 bits exist on MIC.

Please can you confirm this or give a solution ?

Thank you,

Leave a Comment

Please sign in to add a comment. Not a member? Join today