INtel vectorization not efficient

INtel vectorization not efficient

Ritratto di unrue

Dear Intel developers,

I have a Fortran piece of code where my program spend a lot of times:

k=0

id = 1

do j = start, end  
  do i = 1, ns(j)
     k = k + 1  
     if(selectT(lx00(i), j, id) > 1.00) &
      tco(k) = 10.0 
  end do
end do

I'm using  intel/cs-xe-2012 on Intel Xeon E5645. I compiled by using -O3 -ip -ipo -xXost -vec-report=3. The compiler report that nested loop is vectorized, but the execution time of that piece of code is the same without vectorization. I tried to linearize selectT with any results. I tried also to build a "truth table" linearized:

do j = start, end  
  do i = 1, ns(j)
     k = k + 1   
     tco(k) = 10.0*select_cond(offset + lx00(i))
  end do
end do

Do you have any idea how to implement a good vectorization? I suspect the indirect address of lx00(i) break the vectorization, but it is unavoidable

7 post / 0 new
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di Steve Lionel (Intel)

It's difficult to comment without seeing a compilable example. Try adding -guide to get advice on how you might be able to improve vectorization. I am not sure which compiler you are using as we didn't call any "2012".

For more feedback here, include the output of the vectorization report.

Steve
Ritratto di Tim Prince

OP was already advised to look into !$dir vector nontemporal for the 2nd example, to read about indirect prefetch e.g.

 http://software.intel.com/sites/default/files/article/326703/5.3-prefetc...

so it's surprising that this was reposted without responding to previous suggestions.

In the case of indirect fetch, which is the question here, unless -guide produces alerts about streaming-store and indirect prefetch possibility, I'd continue to be disappointed.

I'm looking for ways myself to understand what might be gleaned from vec-report to improve effectiveness of vectorization.  There was an internal slide at Intel a few months ago bragging about the number of cases where it's possible to get report of vectorization, possibly by turning off the cost models, with no interest in whether there was useful performance gain.  This is an excessive swing of the pendulum from the days when we were encouraged not to vectorize in order to show more threaded performance gain.

The recommendation here may be of some interest:

http://software.intel.com/en-us/articles/vecanalysis-python-script-for-a...

These reports distinguish heavy-overhead vector operations from lightweight ones, giving an idea where to look for improvements.  Reporting unaligned and aligned vector operations falls short of what would be useful; vec-report6 is still needed to help identify those.

Ritratto di Martyn Corden (Intel)

The "gather" (indirect addressing of selectT) is less efficient than a contiguous memory access. You're only likely to get a speedup if there's sufficient other computational work in the loop, which doesn't look to be the case in your simple example. Some architectures have a gather hardware instruction, but similar considerations still apply. Vectorizing the gather doesn't necessarily make the gather faster, but it enables vectorization of the rest of the loop, which hopefully will go faster.

      You might remove the additional induction variable, to make things easier for the compiler, though since the loop was vectorized, the compiler probably figured it out already:

k=1

do j = start, end  
  do i = 1, ns(j)
     if(selectT(lx00(i), j, id) > 1.00) &
      tco(k+i) = 10.0 
  end do

  k=k+ns(j)
end do

You can also use the LINEAR clause of a SIMD directive to specify an additional induction variable and require vectorization of the loop.

Explicit prefetching, as mentioned by Tim, can sometimes help, but that depends on the pattern of memory accesses and iteration counts, and is tricky to code.

Ritratto di unrue

Quote:

Tim Prince wrote:

OP was already advised to look into !$dir vector nontemporal for the 2nd example, to read about indirect prefetch e.g.

 http://software.intel.com/sites/default/files/article/326703/5.3-prefetc...

so it's surprising that this was reposted without responding to previous suggestions.

In the case of indirect fetch, which is the question here, unless -guide produces alerts about streaming-store and indirect prefetch possibility, I'd continue to be disappointed.

Hi Tim, I tried your suggest but the performance remain the same. 

Quote:

Tim Prince wrote:

I'm looking for ways myself to understand what might be gleaned from vec-report to improve effectiveness of vectorization.  There was an internal slide at Intel a few months ago bragging about the number of cases where it's possible to get report of vectorization, possibly by turning off the cost models, with no interest in whether there was useful performance gain.  This is an excessive swing of the pendulum from the days when we were encouraged not to vectorize in order to show more threaded performance gain.

The recommendation here may be of some interest:

http://software.intel.com/en-us/articles/vecanalysis-python-script-for-a...

These reports distinguish heavy-overhead vector operations from lightweight ones, giving an idea where to look for improvements.  Reporting unaligned and aligned vector operations falls short of what would be useful; vec-report6 is still needed to help identify those.

Unfortunately, I'm using Intel Ifort 12.0.1

Ritratto di unrue

Quote:

Martyn Corden (Intel) wrote:

The "gather" (indirect addressing of selectT) is less efficient than a contiguous memory access. You're only likely to get a speedup if there's sufficient other computational work in the loop, which doesn't look to be the case in your simple example. Some architectures have a gather hardware instruction, but similar considerations still apply. Vectorizing the gather doesn't necessarily make the gather faster, but it enables vectorization of the rest of the loop, which hopefully will go faster.

      You might remove the additional induction variable, to make things easier for the compiler, though since the loop was vectorized, the compiler probably figured it out already:

k=1

do j = start, end  
  do i = 1, ns(j)
     if(selectT(lx00(i), j, id) > 1.00) &
      tco(k+i) = 10.0 
  end do

  k=k+ns(j)
end do

You can also use the LINEAR clause of a SIMD directive to specify an additional induction variable and require vectorization of the loop.

Explicit prefetching, as mentioned by Tim, can sometimes help, but that depends on the pattern of memory accesses and iteration counts, and is tricky to code.

Hi Martyn, I used your suggest. I also rewrite the code since I have discovered that piece of code is in a function called in three different part of the program. In two of these, lx00 is an array having the same values for each position. So I can avoid indirect addressing saving one values on the first and second calls, but not on the third since on the last call lx00 has different values. So I wrote the same code as:

 

if( lx00 has equal values):

do j =  start, end 
      value = 10.0* selectT_cond((j-1)*nsamples_start + lx00_value)
!dir$ vector always
      do i = k + 1, k + ns(j)
        tco(i)= value
      end do
      k = k + ns(j) 
    end do

else

      do j =  start, end 
      offset =  s + (j-1)*nsamples_start !+ lx00_value
!dir$ SIMD
        do i = 1, ns(j)
          tco(k + i)= 10.0* selectT_cond(offset + lx00(i))
         end do
        k = k + ns(j) 
      end do

endif

 

where selectT_cond has 0 or 1 and substitute the previous if control in the inner loop. Now the performance are 60% faster. How should I use LINEAR directive? As is ?:

!dir$ LINEAR SIMD

 

Thanks.

 

 

Ritratto di Tim Prince

I suppose Martyn was suggesting making the linear designation for k where you had k = k+1 in the loop, if the compiler wasn't already recognizing and vectorizing it.  I've seen situations where trailing indexing statements like that inhibited default vectorization; moving them to the top of the loop could be another solution.  I try to do as you did, making the DO index perform all the linear indexing.

I don't think linear applies to the situation you posed on this last post, where k doesn't appear to be linear.

More documentation on use of  simd elementals and simd linear for Fortran has been discussed, but I haven't seen it actually come out.  Possibly the situation hasn't been reconciled with the transition to !$ omp simd (and may be pending changes for ifort 15.0).

Accedere per lasciare un commento.