# Help for vectorization

## Help for vectorization

Dear Intel developers,

i need to vectorize the folloe code by using intel cs 2013:

```subroutine mysubroutine(n, q)
integer(long),                      intent(IN)  :: n
real(stnd),    dimension(base_dim), intent(OUT) :: q

integer(long)                                   :: nk
integer(long)                                   :: sk, bk
integer(long)                                   :: npow

real(stnd)                                         :: x
integer(long)                                   :: i, j

q  = 0.0
nk = 0
do i = 1, base_dim

x = logarithm( real(n + 1), real(base(i)) )
npow = floor(x)

sk = n
do j  = npow, 0, -1
bk = base(i)**j
nk = floor( real(sk) / real(bk) )
sk = sk - nk * bk

q(i) = q(i) + real(nk) / real(bk * base(i))

end do

end do

end subroutine mysubroutine```

Compiler recognize ANTI ad FLOW dependence between sk  and ANTI ad FLOW dependence bewween q.

Could you like to help me to vectorize the inner loop? TI have no idea how to solve in particular the sk dependence. Thanks in advance.

10 post / 0 nuovi
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

sk is written with sequential dependence (the value for the next iteration depends on the current one).

Similarly with q(i)

Depending on parts you have removed, the compiler would like to optimize the outer loop.  If base_dim were large enough, and you used consistently typed reals and integers, it might like to vectorize portions of the inner loop by interchanging so that a group of i values can be processed by parallel simd.

Your logarithm function apparently would need to be in a form which could be written in line in terms of standard math intrinsics.

How can I do loop interchange if the inner loop iterations number depends on npow calculated in the outer loop?

Is base_dim sufficiently large enough for you to use a parallel loop?

Jim Dempsey

base_dim is about 400. Maybe is quite small to vectorize.

base_dim is about 400. Maybe is quite small to vectorize.

400 is large enough to justify vectorization, although it may be marginal on the MIC.

Post a example which can be compiled and possibly try a current compiler.  The compiler I was trying appears to be distributing the outer loop inside the inner so as to attempt that sort of vectorization.

I'd hope you were familiar enough with your algorithm to have your own ideas about how to interchange loops explicitly.

Tim,

I do not see how he can get vectorization of the inner loop due to each lane of the vector potentially (likely) having different trip counts.

This said if he convoluted the inner loop (or added additional loop nesting) he could potentially run all lanes of the vector provided j can run into negative values .AND. when negative the convolution presents a 0.0 to the summation (and do this with no flow changes in code).

This may be too hard to figure out, and it will rely on the compiler to make sense of the source code. This may be one of the cases where you hand write the code using intrinsics in C++.

Jim Dempsey

What is the average value of npow?

IOW, what is the average trip count of the DO J loop?

If this is large enough, then making DO I= parallel might be worthwhile.

If npow is statistically small, you might be able to pre-compute the results and store into a multi-dimensioned array. Then replace the computation with an index calculation and just fetch the correct result.

Jim Dempsey