Function Vectorization

Function Vectorization

Recently I tried to vectorize a code with simd pragma, and from the Intel VTune report, I see that almost 20% of CPU time is in "__svml_sincos4_e9" function which is apparently the vectorized version of trigonometric functions. My question is, why this function takes this much time, as the non-vectorized version takes less than 1% of CPU time?

I'm using Intel c++ 13.3 with -xAVX and -axAVX flags set on.

11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Evidently, a working example would go a long way to make our responses more productive than the wild speculation you are calling for.  Why not compare a single path vectorized code against a single path non-vector code, and show source code and options, including telling whether it's 64-bit or 32-bit mode?

sincos evidently would be a function for calculating sin() and cos() of the same argument.  Depending on what you have done, you may need to prevent in-lining or set -debug inline-debug-info to see math functions consistently accounted separately from the caller.

Short of that, tell us what is the reason for using pragma simd?  Is it to avoidt setting -ansi-alias or using __restrict pointers, or on account of a non-unity stride situation where vec-report would give "seems inefficient" as a reason for non-vectorization?

Is that VML library function?

Is this issue consistent with every argument beign passed to vectorized sincos() function?How do you call the functions?Do you have some variable interdependencies?I suppose and I could be wrong that  execution ports are stalled during the exuction of vectorized sincos() function probably when the other floating - point code  thread is utilizing them .

>>..."__svml_sincos4_e9" function which is apparently the vectorized version of trigonometric functions...

Yes, you're correct and SVML stands for Short Vector Math Library. There is a desription of SVML at:
.
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013...

Regarding a '...20% of CPU time...' number it is Not clear why it happens and a test case would really help.

And one more thing.

>>..."__svml_sincos4_e9" function

Since the function __svml_sincos4_e9 is optimized for processors with Intel AVX instruction set for 64-bit platforms ( function with e9 code is used ) you need to verify for AVX to SSE transitions in your codes.

Tahnks for your replys,

Basically, the reason I'm using "#pragma omp simd" is portability, so in the future, we may move to AMD platform, or other co-processors, or even other compilers, so using compiler specific flags is not a very good idea.

In the code I've got three loops which I vectorized the most inner loop and inside that loop I'm calling an inline function which use sin(x) and cos(x) functions and 'x' is calculated based on the functions's arguments. The suedo-code is like:

for (int j = 0; j < PRTCL; ++j){
for (int k = 0; k < EBin; ++k){
#ifdef VEC
# pragma omp simd
#endif
 for (int n = 0; n < VEC_LEN; ++n){
evolve(arg1[j][k][n],arg2[j][k][n],...);
}
}
}

and VEC_LEN is 4, since AVX registers can hold up to four doubles. The platform is 64bit, my CPU is Intel Core-i7-3930K and almost everything in my code is declared as double.

Also, I declared the function as inlined, because it's recommended for vectorized code.

Currently, I'm using a single-threaded program, but I'm gonna parallelize the other loop with OpenMP.

Another point of slowness in the code is call to the __svml_exp4_e9 which I'm using "exp" function in the other part of my code. According to VTune analysis, in the non-vectorized code the exp function takes ~1sec, but in the vectorized code, the __svml_exp4_e9 takes ~4sec. Do I need to do some tunning before call to math functions?

>>...Do I need to do some tunning before call to math functions?..

It looks like No because codes are portable ( you've mentioned that ) and implemented without any intrinsic functions ( is that correct? ). Inspect assembler codes in order to understand if there is any mixing of AVX and SSE instructions.

As it was already mentioned your code could have AVX-to-SSE transition penalties.Your programme is single-threaded so there is no execution ports stalls.But I am thinking about the possibility that your thread could have some interdependencies in floating point code so the underlying hardware(Port0 and Port1) cannot fully exploit instruction level paralellism.

Well, as you can see in the attachment, there is no AVX to SSE and SSE to AVX conversions in the __svml_ functions, so I'm still wondering what the reason of __svml_ slowness is, as the non-vec functions are fast.

Attachments: 

>>...I'm still wondering what the reason of __svml_ slowness is, as the non-vec functions are fast.

A test case is needed in order to understand what is going on and to answer your questions.

Leave a Comment

Please sign in to add a comment. Not a member? Join today