Loading...
You are not logged-in Login/Register





  • Posts   Search Threads
  • hpcmangoJune 26, 2009 5:28 AM PDT   
    OpenMP breaks auto-vectorization

    Hi,

    for quite some time a regularly encounter the effect that loops are no longer vectorized when they are inside an outer OpenMP-parallel loop. The vectorization works fine though if I remove the '#pragma omp parallel for'.

    Example code:

    #pragma omp parallel for
    for(int i=1;i<sizeY-1;i++) {
    int curIndex=1+i*sizeX;
    for(int j=1;j<sizeX-1;j++) {
    dataB[curIndex]=0.1*(dataA[curIndex-1]+dataA[curIndex+1]+dataA[curIndex-sizeX]+dataA[curIndex+sizeX])+0.6*dataA[curIndex];
    curIndex++;
    }
    curIndex+=2;
    }

    The vectorization report says:

    ***.cxx(38): (col. 5) remark: loop was not vectorized: not inner loop.
    ***.cxx(40): (col. 7) remark: loop was not vectorized: existence of vector dependence.
    ***.cxx(41): (col. 2) remark: vector dependence: assumed FLOW dependence between dataB line 41 and dataB line 41.
    ***.cxx(41): (col. 2) remark: vector dependence: assumed ANTI dependence between dataB line 41 and dataB line 41.

    If I use a '#pragma ivdep' before the inner loop, I get:

    ***.cxx(38): (col. 5) remark: loop was not vectorized: not inner loop.
    ***.cxx(41): (col. 7) remark: loop was not vectorized: unsupported data type.

    If I use additionally a '#pragma vector always'' before the inner loop, I still get the same.

    I did this with compiler version 11.0 on x86_64 linux, but I remember the result be quite the same for 10.0 and 10.1.

    Can anyone explain this to me? I don't see a reason why vectorization should not work here. Is there a way to fix the problem, .e.g. by pragmas?

    Best

    Oliver

    TimP (Intel)June 26, 2009 6:35 AM PDT
    Rate
     
    Re: OpenMP breaks auto-vectorization

    We had a case where the poster gave a complete working example on the forum.  In that case, 2 steps were required to fix it:
    1) upgrade to 11.1
    2) set -inline-max-size=50  (this value was low enough to stop in-lining of a function with omp parallel)

    Even though 11.1 is intended to be less aggressive on in-lining than 10.1 and 11.0, in that case it still needed the option to help out. 

    Do you still get vectorization without -openmp but no vectorization with -openmp, if you turn off in-lining?

    A complete case would be required to see how you have set it up so that the compiler doesn't have to be concerned about aliasing between dataA and dataB when you don't set -openmp, but is concerned when -openmp is set.  If those were function parameters, appropriate restrict qualifiers would be needed.  It's possible that the analysis might be affected by a change from default static allocation without -openmp to stack allocation with -openmp, or by the compiler correctly stopping in-lining when you set -openmp.


    hpcmangoJune 26, 2009 7:34 AM PDT
    Rate
     
    Re: OpenMP breaks auto-vectorization

    Thanks for the answer.

    Unfortunately upgrading to 11.1 will take a week or so, because I'm not the admin on the machine with icc. I will try it of course, when the upgrade is done.

    I tried -inline-max-size=50 with version 11.0 though, but it didn't help. Anyway I am wondering how this could have an effect, as it seems to be about function inlining and my code only has a main.

    While playing around, I found another way to get it to vectorize. I am using a timer c++ class (which simply wraps the posix high resolution timers for convenience) to measure time. If I remove the usage of this class (and inline the timer code instead into main) vectorization works also with OpenMP.

    Really strange. The timer code is completely outside the region of interest. Is it possible that OpenMP or vectorization doesn't like object oriented programming and shuts done completely when using it?

    -- Main.cxx ---
    #include <stdio.h>
    #include <stdlib.h>

    #include <omp.h>

    #include "Timer.hxx"

    int main()
    {
    #pragma omp parallel
    {
    printf("OpenMP thread = %i/%i.n",omp_get_thread_num(),omp_get_num_threads());
    }

    const int sizeX = 8192;
    const int sizeY = 8192;
    const int loops = 100;

    float* __restrict dataA;
    float* __restrict dataB;

    int dataSize=sizeof(float)*sizeX*sizeY;

    dataA=(float*)malloc(dataSize);
    dataB=(float*)malloc(dataSize);

    for(int i=0;i<sizeY;i++) {
    for(int j=0;j<sizeX;j++) {
    dataA[i*sizeX+j]=0;
    }
    }
    dataA[(sizeY/2)*sizeX+(sizeX/2)]=1;

    Timer timer;
    for(int iLoop=0;iLoop<loops;iLoop++) {

    #pragma omp parallel for
    for(int i=1;i<sizeY-1;i++) {
    int curIndex=1+i*sizeX;
    for(int j=1;j<sizeX-1;j++) {
    dataB[curIndex]=0.1*(dataA[curIndex-1]+dataA[curIndex+1]+dataA[curIndex-sizeX]+dataA[curIndex+sizeX])+0.6*dataA[curIndex];
    curIndex++;
    }
    curIndex+=2;
    }

    #pragma omp parallel for
    for(int i=1;i<sizeY-1;i++) {
    int curIndex=1+i*sizeX;
    for(int j=1;j<sizeX-1;j++) {
    dataA[curIndex]=0.1*(dataB[curIndex-1]+dataB[curIndex+1]+dataB[curIndex-sizeX]+dataB[curIndex+sizeX])+0.6*dataB[curIndex];
    curIndex++;
    }
    curIndex+=2;
    }
    }
    double duration=timer.get();
    fprintf(stderr,"Time = %g s, Performance = %g FLOPSn",duration,6.*(sizeX-1)*(sizeY-1)*2*loops/duration);

    fprintf(stderr,"n");
    for(int i=sizeY/2-5;i<=sizeY/2+5;i++) {
    for(int j=sizeX/2-5;j<=sizeX/2+5;j++) {
    fprintf(stderr,"%f ",dataA[i*sizeX+j]);
    }
    fprintf(stderr,"n");
    }

    free(dataA);
    free(dataB);

    return 0;
    }

    --- Timer.hxx ---
    #ifndef om_timer_hxx_
    #define om_timer_hxx_

    #include <time.h>

    class Timer {
    public:
    Timer() {
    reset();
    }
    void reset() {
    clock_gettime(CLOCK_MONOTONIC,&m_Timespec);
    }
    double get() {
    struct timespec endTimespec;
    clock_gettime(CLOCK_MONOTONIC,&endTimespec);
    return (endTimespec.tv_sec-m_Timespec.tv_sec)+
    (endTimespec.tv_nsec-m_Timespec.tv_nsec)*1e-9;
    }
    double getAndReset() {
    struct timespec endTimespec;
    clock_gettime(CLOCK_MONOTONIC,&endTimespec);
    double result=(endTimespec.tv_sec-m_Timespec.tv_sec)+
    (endTimespec.tv_nsec-m_Timespec.tv_nsec)*1e-9;
    m_Timespec=endTimespec;
    return result;
    }
    private:
    struct timespec m_Timespec;
    };

    #endif



    jimdempseyatthecoveJune 26, 2009 8:48 AM PDT
    Rate
     
    Re: OpenMP breaks auto-vectorization


    What happens when you place each parallel for into seperate functions then compile with and without IPO?

    Jim Dempsey


    Blog: The Parallel Void
    www.quickthreadprogramming.com

    jimdempseyatthecoveJune 26, 2009 8:55 AM PDT
    Rate
     
    Re: OpenMP breaks auto-vectorization


    I forgot to mention, have you tried #pragma vector always on the inner loop?

    Jim

    Blog: The Parallel Void
    www.quickthreadprogramming.com

    TimP (Intel)June 26, 2009 12:14 PM PDT
    Rate
     
    Re: OpenMP breaks auto-vectorization

    My copies of icpc 11.0/083 and 11.1 for intel64 vectorize both parallel loops, when -ansi-alias is set.  If  you don't set that flag, you are telling the compiler that you may have violated the standards on data type aliasing.
    You may argue that there is nothing here in the line of aliasing (such as the possibility of  your float data updates over-writing curIndex) which the compiler should be concerned about, but I'll leave it to you if you wish to submit a report on premier.intel.com to make that case.

    As near as I can find out, some BSD variants use the keyword __restrict, but it would be ignored by icpc.  It's mentioned in Microsoft documentation, but I haven't found a Microsoft or Intel compiler which observes it.  It doesn't appear to make the difference here; the compiler apparently can see that you have malloc'd 2 distinct regions.


    hpcmangoJune 29, 2009 1:24 AM PDT
    Rate
     
    Re: OpenMP breaks auto-vectorization

    Thanks, Tim.

    Using 'ansi-alias' seems to be the simplest solution. Though I fear as it makes the optimization working by assuming less type aliasing, it could be easily possible to break it again, e.g. by using ints and floats inside the timer class.

    Altogether I also get the impression that this should be considered a compiler bug.

    @Jim: no, '#pragma vector always' doesn't help. What helps though is moving the 'parallel for' section to a separate function and put this function into a '#pragma auto_inline off' section.

    Best,

    Oliver




    jimdempseyatthecoveJune 29, 2009 6:07 AM PDT
    Rate
     
    Re: OpenMP breaks auto-vectorization


    Oliver,

    Good work around. I've found pushing code out of line to work around other OpenMP problems before. If your inner loop has a significant iteration count then the call overhead shouldn't be too bad. I would consider this a problem in the optimization code. If you can submit a simple code sample to premier support then they should be able to identify the problem and fix it.

    Jim


    Blog: The Parallel Void
    www.quickthreadprogramming.com

    TimP (Intel)June 29, 2009 6:07 AM PDT
    Rate
     
    Re: OpenMP breaks auto-vectorization

    Situations are common where it's not possible to optimize without -ansi-alias. It would be poor practice to write code which violates the standard which has been in effect for 20 years, and has been the default requirement in all common compilers except Microsoft's for 10.
    I don't think you're clear on which aspect of this you wish to consider a bug, but you're welcome to file a bug report. I don't think Intel will adopt consistency with gcc or g++ when it conflicts with Microsoft.
    A possible feature request might be to fix the vec-report2 so it says "this loop is not vectorizable on account of -no-ansi-alias."
    I've never seen anyone propose a treatment more like HP's C, so I don't think that would be popular enough to be considered.


    hpcmangoJune 30, 2009 12:58 AM PDT
    Rate
     
    Re: OpenMP breaks auto-vectorization

    Quoting - tim18
    Situations are common where it's not possible to optimize without -ansi-alias. It would be poor practice to write code which violates the standard which has been in effect for 20 years, and has been the default requirement in all common compilers except Microsoft's for 10.
    I don't think you're clear on which aspect of this you wish to consider a bug, but you're welcome to file a bug report. I don't think Intel will adopt consistency with gcc or g++ when it conflicts with Microsoft.
    A possible feature request might be to fix the vec-report2 so it says "this loop is not vectorizable on account of -no-ansi-alias."
    I've never seen anyone propose a treatment more like HP's C, so I don't think that would be popular enough to be considered.

    Hi Tim,

    of course it is debatable what exactly the bug is.

    What disturbs me is, that the behaviour of the compiler is quite unpredictable. Apparently unrelated pieces of code (the timer object, OpenMP pragmas) break the vectorization, which in the simple case works fine.

    Why should '-ansi-alias' be required with OpenMP but not without?

    Ideally I wished the compiler to figure out that these things do not affect if the loop should be assumed vectorizable or not.

    Best,

    Oliver



    TimP (Intel)June 30, 2009 6:23 AM PDT
    Rate
     
    Re: OpenMP breaks auto-vectorization

    Quoting - hpcmango

    Why should '-ansi-alias' be required with OpenMP but not without?


    In general, violations of -ansi-alias could create race conditions which break OpenMP as well as -parallel.  I don't have high expectations for compilation without -ansi-alias. 
    I do agree that the compiler should be less obscure about which optimizations are disabled by default, as well as which options are needed for consistency with other compilers.  I put -ansi-alias in icc.cfg and icpc.cfg, so as not to have to remember to set it on command line.
    The first criterion often seems to be not to miss optimizations which MSVC performs, and vectorization is not one of those.


Forum jump:  

Intel Software Network Forums Statistics

17,025 users have contributed to 48,319 threads and 172,758 posts to date.

In the past 24 hours, we have 11 new thread(s) 54 new posts(s), and 47 new user(s).

In the past 3 days, the most popular thread for everyone has been Optimalization of sine function\'s taylor expansion The most posts were made to Most likely, the issue is that The post with the most views is Optimalization of sine function\'s taylor expansion

Please welcome our newest member redfruit83


For more complete information about compiler optimizations, see our Optimization Notice.