ifort 13.0 generates unneeded code during vectorization

ifort 13.0 generates unneeded code during vectorization

Аватар пользователя styc

Compile the attached source file with '-O3' and either of '-xSSE4.2' and '-xAVX'. ifort 13.0 vectorizes the k-loop but generates an unneeded scalar version. Since the loop count is 4, in no cases can that scalar version be used.

By the way, the compiler seems to be too aggressive in vectorization. It generates simulated gathers for accesses to the o array. In order to use VPSLLD, it uses three instructions to pack four integers into a vector, then uses another 7 instructions to unpack them into four GPRs. It would have better to just use GPRs from the start and use SHL/LEA instead of VPSLLD.

ВложениеРазмер
Скачать test.f901.3 КБ
12 сообщений / 0 новое
Последнее сообщение
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Аватар пользователя styc

Just to be sure, is it that Intel is not interested in reports like this that are not bugs per se? If so, I will refrain from reporting something like this in the future.

Аватар пользователя Kevin Davis (Intel)

We apologize for the delayed reply. I overlooked it. We do appreciate reports such as this here. I will investigate your findings and post an update soon.

Аватар пользователя jimdempseyatthecove

Until the fix can be implimented, consider generating pick lists:


subroutine foo(c, ld, n, m, o, x)
    double precision, intent(in) :: c(0 : 3, 0 : 6)

    integer, intent(in) :: ld, n, m

    integer, intent(in) :: o(ld, m)

    double precision, intent(inout) :: x(ld, m)
    double precision t(0 : 3, 0 : 3), u, v, w

    integer i, k

    integer o0k(0:3), o1k(0:3), o2k(0:3)

    t = 0.d0
    do i = 4, n - 5, 3

        do k = 0, 3

            o0k(k) = o(i, 1 + k)

            o1k(k) = o(i+1, 1 + k)

            o2k(k) = o(i+2, 1 + k)

        end do

        do k = 0, 3

            u = t(k, 1)

            v = t(k, 2)

            w = t(k, 3)

            t(k, 1) = x(i, 1 + k)

            t(k, 2) = x(i + 1, 1 + k)

            t(k, 3) = x(i + 2, 1 + k)

            x(i, 1 + k) = &

                      c(0, o0k(k)) * t(k, 1)&

                    + c(1, o0k(k)) * (w + t(k, 2))&

                    + c(2, o0k(k)) * (v + t(k, 3))&

                    + c(3, o0k(k)) * (u + x(i + 3, 1 + k))

            x(i + 1, 1 + k) = &

                      c(0, o1k(k)) * t(k, 2)&

                    + c(1, o1k(k)) * (t(k, 1) + t(k, 3))&

                    + c(2, o1k(k)) * (w + x(i + 3, 1 + k))&

                    + c(3, o1k(k)) * (v + x(i + 4, 1 + k))

            x(i + 2, 1 + k) = &

                      c(0, o2k(k)) * t(k, 3)&

                    + c(1, o2k(k)) * (t(k, 2) + x(i + 3, 1 + k))&

                    + c(2, o2k(k)) * (t(k, 1) + x(i + 4, 1 + k))&

                    + c(3, o2k(k)) * (w + x(i + 5, 1 + k))

        end do

    end do
end subroutine

Jim Dempsey

www.quickthreadprogramming.com
Аватар пользователя styc

@Jim
With your modifications, the compiler no longer vectorizes the loop. That eliminates the root cause of all raised issues. If the second k-loop is force-vectorized with '!dec$ simd', the scalar version is still generated.

Performancewise perhaps not vectorizing is a better decision. In the real code (not this reduced case), disabling vectorization with '!$dec novector' for this loop improves performance. I cannot tell whether that is because this loop is simply not worth vectorization, or the extra dependence caused by VPSLLD is incurring excessive delays. I can find no way to tell the compiler not to generate VPSLLD.

Аватар пользователя jimdempseyatthecove

>>With your modifications, the compiler no longer vectorizes the loop. That eliminates the root cause of all raised issues. If the second k-loop is force-vectorized with '!dec$ simd', the scalar version is still generated.

From my programming perspective:

My primary concern is not if the compiler reports vectorization or not, or if vectorization is used or not.
Rather, that the compiler uses vectorization when it is appropriate (read faster code).

With the pick list modifications, did the code run faster than without pick list (with and without explicit simd vectorization)?

What I am trying to teach the readers of this thread is: Do not assume vectorization is always best (force it when not appropriate), and at times help out the compiler (e.g. incorporating the pick list).

BTW - it was a good catch to look down to the disassembly level to notice the root cause of additional overhead. Not all posters do this. This is not as hard as it seams.

Jim Dempsey

www.quickthreadprogramming.com
Аватар пользователя styc

Цитата:

jimdempseyatthecove wrote:

>>With your modifications, the compiler no longer vectorizes the loop. That eliminates the root cause of all raised issues. If the second k-loop is force-vectorized with '!dec$ simd', the scalar version is still generated.

From my programming perspective:

My primary concern is not if the compiler reports vectorization or not, or if vectorization is used or not.
Rather, that the compiler uses vectorization when it is appropriate (read faster code).

With the pick list modifications, did the code run faster than without pick list (with and without explicit simd vectorization)?

What I am trying to teach the readers of this thread is: Do not assume vectorization is always best (force it when not appropriate), and at times help out the compiler (e.g. incorporating the pick list).

BTW - it was a good catch to look down to the disassembly level to notice the root cause of additional overhead. Not all posters do this. This is not as hard as it seams.

Jim Dempsey

Maybe you misunderstood. I meant that the compiler generates a useless scalar remainder loop when it decides to vectorize. This is a separate issue from whether it makes good decisions on whether to vectorize or not.

I did not test your code, but now that it prevents vectorization, I presume that '!dec$ novector' will have the same effect (or better effect because the compiler does not need to worry about the o?k arrays). And yes, my measurements did show that disabling vectorization results in better performance. The uncertain point is, like I mentioned, the reason of such performance degradation. Whether this code is worth vectorizing cannot be immediately tested because compiler generates less-than-ideal code.

Аватар пользователя Kevin Davis (Intel)

Development continues to investigate and commented regarding the scalar version writing "Can't get rid of scalar code ---- if vectorized. The array dim size LD may be zero and the scalar code is used for that fall back path."

I will update as I hear more and pass any comments back you may have.

Аватар пользователя styc

Цитата:

Kevin Davis (Intel) wrote:

Development continues to investigate and commented regarding the scalar version writing "Can't get rid of scalar code ---- if vectorized. The array dim size LD may be zero and the scalar code is used for that fall back path."

I will update as I hear more and pass any comments back you may have.

The value of ld cannot affect the vectorizability of the k-loop. k only ever appears in the second subscripts of references to the o and x arrays, which has nothing to do with ld. Furthermore, if ld is indeed zero, then the i-loop must not run, i.e., n must be less than 9, otherwise all accesses to o and x will be out of bounds. In this case, no code is ever needed, including the scalar loop.

Аватар пользователя Kevin Davis (Intel)

Thank you for the feedback. I failed to indicate earlier that this issue was reported to Developers under the internal tracking id, DPD200237580. I added your latest feedback and will update when I learn more.

(Internal tracking id: DPD200237580)

Аватар пользователя styc

Цитата:

Kevin Davis (Intel) wrote:

Thank you for the feedback. I failed to indicate earlier that this issue was reported to Developers under the internal tracking id, DPD200237580. I added your latest feedback and will update when I learn more.

(Internal tracking id: DPD200237580)

Does the issue regarding the use of vector shift vs scalar shift has a tracking id as well?

Аватар пользователя Kevin Davis (Intel)

Both issues were reported under the same internal tracking id I noted earlier. If it becomes necessary to split them I will but Development is considering both issues at this time.

Зарегистрируйтесь, чтобы оставить комментарий.