ifort 13 does not vectorize Newton-Raphson iteration coded as elemental function

ifort 13 does not vectorize Newton-Raphson iteration coded as elemental function

Portrait de styc

Attached is a small program implementing the Newton-Raphson iteration for solving y = x * exp(x). ifort 13 does not vectorize the program unless the MIC architecture is targeted. Comparing the Fortran again with the equivalent C code written using the elemental function extension, the C code shows a 1.8x speedup when measured on Nehalem. Arguably, icc 13 is not optimizing hard enough, either. A version based on intrinsic functions shows 2.1x speedup over the Fortran code. Greater gains can obviously be expected on Sandy/Ivy Bridge.

Fichier attachéTaille
Téléchargement test.f90789 octets
Téléchargement test.c2.13 Ko
3 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de Tim Prince

I see that the Fortran elemental doesn't have the same effect on optimization here.as writing in the parallel intrinsics in icc. If I set more aggressive options, I get the message "not inner loop" indicating that the compiler hasn't learned outer loop vectorization for this situation. In effect, in your C code intrinsics, you have explicitly pushed enough work inside the while loop to take advantage of simd.

Portrait de styc

Quote:

TimP (Intel) wrote:

I see that the Fortran elemental doesn't have the same effect on optimization here.as writing in the parallel intrinsics in icc. If I set more aggressive options, I get the message "not inner loop" indicating that the compiler hasn't learned outer loop vectorization for this situation. In effect, in your C code intrinsics, you have explicitly pushed enough work inside the while loop to take advantage of simd.

To clarify, I misread the generated assembly code for the MIC architecture. It is not vectorized, either.

I did not explicitly make the loop body heavier in the intrinsic code than in the scalar code. The algorithm is exactly the same. This is a case where vectorization is almost always beneficial. Masking adds some small overhead, but you save a lot from vectorized division alone.

Connectez-vous pour laisser un commentaire.