Vectorization Issue with loop iterations

Vectorization Issue with loop iterations

Hi All,

I am trying to compile following sample kernel with Intel (ICC) 14.0.0 20130728 (or version > 12 ). I see strange behaviour with vectorization. I have following questions:

  • If I change _iml variable type to int instead of long int, compiler doesn't vectorize the code. If I see vectorization report with  -vec-report3, I see large report with ANTI and FLOW dependencies which seems correct.  But I didn't understand what compiler does to vectorize when I change loop iteration variable type to long int.
  • Below example is auto-generated kernel from domain specific language. We have large array and we process 18 elements of array for every iteration (say those 18 elements represent a particle). So iterations are independent. But this memory layout looks similar to AoS (arrya of struct with 18 elements). AoS is not good for vectorization, I want to understant how Intel compiler vectorize this code.

compute() function is actual compute kernel that  I want to vectorize. Please follow the comments for more explaination:

#include <math.h>
#define AOS_BLOCK 18

void compute(double *pdata, int num_mechs) {

    double* _p;

    /* ISSUE :  If I change _iml to int instead of long int
     * compiler doesn't vectorize the code. Why? 
     */
    long int _iml;

    /* for each iteration of loop, we process 18 elements of pdata 1-d array */
    for (_iml = 0; _iml < num_mechs; ++_iml) {

        /* take pointer to start of 18 blocks element */
        _p = &pdata[_iml*AOS_BLOCK];

        /* below calculations are generanted from DSL to C code converter, looks ugly I know!  
         * we do some computation on those 18 elements only, so you don't need to understand
        */

        if ( _p[16]  == - 35.0 ) {
            _p[16] = _p[16] + 0.0001 ;
        }

        _p[8] = ( 0.182 * ( _p[16] - - 35.0 ) )/ ( 1.0 - ( exp ( - ( _p[16] - - 35.0 ) / 9.0 ) ) ) ;
        _p[9] = ( 0.124 * ( - _p[16] - 35.0 ) )  / ( 1.0 - ( exp ( - ( - _p[16] - 35.0 ) / 9.0 ) ) ) ;
        _p[6] = _p[8] / ( _p[8] + _p[9] ) ;
        _p[7] = 1.0 / ( _p[8] + _p[9] ) ;

        if ( _p[16]  == - 50.0 ) {
            _p[16] = _p[16] + 0.0001 ;
        }

        _p[12] = ( 0.024 * ( _p[16] - - 50.0 ) ) / ( 1.0 - ( exp ( - ( _p[16] - - 50.0 ) / 5.0 ) ) ) ;

        if ( _p[16]  == - 75.0 ) {
            _p[16] = _p[16] + 0.0001 ;
        }

        _p[13] = ( 0.0091 * ( - _p[16] - 75.0 ) ) / ( 1.0 - ( exp ( - ( - _p[16] - 75.0 ) / 5.0 ) ) ) ;
        _p[10] = 1.0 / ( 1.0 + exp ( ( _p[16] - - 65.0 ) / 6.2 ) ) ;
        _p[11] = 1.0 / ( _p[12] + _p[13] ) ;

        _p[3] = _p[3] + (1. - exp(0.01*(( ( ( -1.0 ) ) ) / _p[7])))*(- ( ( ( _p[6] ) ) / _p[7] ) / ( ( ( ( -1.0) ) ) / _p[7] ) - _p[3]);
        _p[4] = _p[4] + (1. - exp(0.01*(( ( ( -1.0 ) ) ) / _p[11])))*(- ( ( ( _p[10] ) ) / _p[11] ) / ( ( ( ( -1.0) ) ) / _p[11] ) - _p[4]);
    }
}

int main(int argc, char *argv[])
{
    int i, n;
    double * data;

    if(argc < 2)
    {
        printf("\n Pass lenght of an array as argument \n");
        exit(1);
    }

    n = atoi( argv[1] );

    //data = _mm_malloc( sizeof(double) * n, 32);
    data = (double *) malloc( sizeof(double) * n * AOS_BLOCK);

    /* main compute function */
    compute( data, n);

    if(argc > 3)
        for(i=0; i<n ; i++)
            printf("\t %lf", data[i]);

    free(data);
    //_mm_free(data);
}

Any comments to understand this code and vectorization is appreciated.

Thanks!

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

To provide more information, here is my compilation/vectorization report:

[kumbhar@dom38 ~]$ icc -vec-report3 vec_test_intel.c 

vec_test_intel.c(66): (col. 5) remark: LOOP WAS VECTORIZED
vec_test_intel.c(69): (col. 9) remark: loop was not vectorized: existence of vector dependence
vec_test_intel.c(14): (col. 5) remark: LOOP WAS VECTORIZED

 

If I change _iml to int, I see

[kumbhar@dom38 ~]$ icc -vec-report3 vec_test_intel.c 
vec_test_intel.c(14): (col. 5) remark: loop was not vectorized: existence of vector dependence
vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 37
vec_test_intel.c(37): (col. 13) remark: vector dependence: assumed FLOW dependence between _p line 37 and _p line 22
vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 40
vec_test_intel.c(40): (col. 9) remark: vector dependence: assumed FLOW dependence between _p line 40 and _p line 22
vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 41
vec_test_intel.c(41): (col. 9) remark: vector dependence: assumed FLOW dependence between _p line 41 and _p line 22
vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 42
vec_test_intel.c(42): (col. 9) remark: vector dependence: assumed FLOW dependence between _p line 42 and _p line 22

 

 

To provide more information:

Here is compilation report with long int _iml

[kumbhar@dom38 ~]$ icc -vec-report3 vec_test_intel.c 
vec_test_intel.c(66): (col. 5) remark: LOOP WAS VECTORIZED
vec_test_intel.c(69): (col. 9) remark: loop was not vectorized: existence of vector dependence
vec_test_intel.c(14): (col. 5) remark: LOOP WAS VECTORIZED

And this is with int _iml

vec_test_intel.c(14): (col. 5) remark: loop was not vectorized: existence of vector dependence
vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 37
vec_test_intel.c(37): (col. 13) remark: vector dependence: assumed FLOW dependence between _p line 37 and _p line 22
vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 40
vec_test_intel.c(40): (col. 9) remark: vector dependence: assumed FLOW dependence between _p line 40 and _p line 22
vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 41
vec_test_intel.c(41): (col. 9) remark: vector dependence: assumed FLOW dependence between _p line 41 and _p line 22
vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 42

 

 

sorry about multiple post; I no longer have delete privilege.

Meaning of long int depends on OS and 32/64-bit target identification. Is this Intel64 linux or Mac, where it would be a 64-bit data type?

Available vectorization modes depend on target architecture.  Presumably vectorization would use simulated gather/scatter (possibly vgather instructions on corei7-4) so as to pack multiple iterations into simd data and enable use of svml functions (effectively converting to SoA).

The compilers have made significant stability improvements with updates since the initial 14.0.

If you would attach code which could be compiled, we could find out ourselves what happens.

The constants don't look precise enough to justify use of double data types unless to avoid overflow in the exponentials at those arbitrarily shifted singularity points.

Thanks Tim for clarification. Details you asked are:

  •  x86_64 GNU/Linux
  • Xeon(R) CPU E5-2670 0 @ 2.60GHz
  • Compilers I tested: icc (ICC) 13.1.0 20130121, (ICC) 14.0.0 20130728

I have attached code here (basically same example I posted above). I am implementing SoA version of the same kernel and want to know the performance of AoS and SoA memory layout for above kernel. So it will be helpful if you confirm what is happening with vectorization in the attached kernel (as I am not an assembly expert! :) ).

Thanks!

Attachments: 

AttachmentSize
Download vec_intel.c2.16 KB

I took a quick look with 14.0.2 compiler on Windows, as that's the only AVX(2) machine I have.  I see the effect you reported of vectorization being enabled when I set long long int iml_ (Windows compiler ignores plain long).  The vectorization is done with mostly scalar memory accesses (on account of the AoS data) to make short vectors in registers, enabling use of the short vector exp() calls.  Spills and reloads are done with AVX128 memory access.

The scalar exp calls are also made (for consistency) to the short vector library, discarding the extra slots

 

Could you try -opt-subscript-in-range compiler option?

 

Quote:

om-sachan (Intel) wrote:

Could you try -opt-subscript-in-range compiler option?

Yes, compiler vectorize loop if I use -opt-subscript-in-range compilation flag.  I see following explanation for this flag:

 

If you specify -opt-subscript-in-range (Linux* OS and OS X*) or /Qopt-subscript-in-range (Windows* OS), the compiler assumes that there are no "large" integers being used or being computed inside loops. A "large" integer is typically > 231. This feature can enable more loop transformations.

But I didn't understand completely. In the following example,  could you explain how compiler decide to vectorize when we use above option -opt-subscript-in-range? (If you look at the example I posted, _iml is used only for loop iteration and calculating offset) 

/* this is not vectorizable */
int _iml;
for (_iml = 0; _iml < 1000000; ++_iml) {
        ..............
}


/* this is vectorizable */
long int _iml;
for (_iml = 0; _iml < 1000000; ++_iml) {
     ..............
}

This will be great help to understand this better! (full example of above loops is attached in top posts).

 

 

 

 

 

Maybe compiler took into account the magnitude of _iml variable,but it also doesn't make sense because _iml < 1e+6 and still it is inside its range.

im1_ is multiplied explicitly by 18 in the posted sample so there may be a concern about signed overflow.  The compiler must perform some kind of strength reduction in order to set up the gather of several iterations at stride 18 into parallel simd.  It ought to pay off by the use of parallel simd divide and short vector exp function.

In my simpler example, icc under-performs gcc when a subscript is calculated by multiplying the for index. gcc performs classical strength reduction (and does not attempt vectorization), but icc "optimizes" to an lea chain and "vectorizes" to avx-128 by simulated gather-scatter (it not longer performs the multiplication inside the inner loop).  opt-subscript-in-range makes no difference. 

I agree with Tim.

 

Om

Leave a Comment

Please sign in to add a comment. Not a member? Join today