/QxHost vs /QxAVX

/QxHost vs /QxAVX

I have recently been trying to test the effectiveness of AVX instructions with OpenMP.
Unfortunately I am not getting the results I expected and I suspect I do not fully understand the difference in the /QxHost vs /QxAVX options.

I have ifort installed in a Xeon processor and have been running the resulting .exe's on the Xeon, i5 and i7 processors.
The compile options I am using are:
set options=/Tf %1.f90 /free /O2 /QxAVX /Qopenmp
set options=/Tf %1.f90 /free /O2 /QxHost /Qopenmp

The problem I am getting is the the .exe compiled with /QxAVX runs slower on the i5 and i7 in comparison to that compiled with /QxHost

I confirmed that the /AVX .exe would not run on the Xeon, which does not support AVX instructions.

I was expecting that the .exe generated with /QxHost would use instructions that are compatible with the Intel(R) Xeon(R) W3520 CPU @ 2.67 GHz : 12.0 GB : Win 7 Enterprise, which does not support AVX instructions.

This result makes me question: does the .exe compiled with /QxHost enable instructions based on the computer where the .exe was generated or does it provide for multiple instruction sets and adopt the instruction set of the computer on which it is being run?

The version of the compiler I am using is Version 12.1.5.344 Build 20120612.

John

publicaciones de 7 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

To run the program on the i5 and i7, I included the file:
C:\Program Files (x86)\Intel\Composer XE 2011 SP1\redist\intel64\compiler\libiomp5md.dll
with the .exe. I did not get messages that other files were required.
The .exe were generated using a .bat file, not in a VS project.

John

qxhost chooses latest instruction set available where compiler is running.

Tim,

Thanks for the clarification, but I would have expected that if I selected /O2 /QxAVX /Qopenmp and ran the resulting .exe on an i7 (Elapse time = 84.021 seconds), then it should perform better that a .exe generated with /O2 /QxHost /Qopenmp on a Xeon(R) W3520 CPU then run on an i7 (Elapse time = 79.069 seconds), rather than 6% slower.  This is a compute intensive run, based on the recent stack overflow post, where I have been able to achieve significant improvements in the use of !$OMP.

Does /Qopenmp conflict with /QxAVX or am I doing something else wrong ?

John

I guess you're comparing SSE4.2 code generated by /QxHost on I7-1 against AVX code running on I7-2 or maybe I7-3.

It's not unusual to see better performance with SSE4.1 or even SSE2 code than with AVX.  I think some AVX deficits may have been overcome in the current compiler release.

Beyond that, some factors which could result in SSE4.1 or SSE4.1 code running faster than AVX include data alignment,  loop lengths, and cache behavior.  Performance advantage for AVX on CPUs prior to Haswell usually requires extreme L1 cache locality.

32-byte data alignment, perhaps promoted by /align array32byte, can help performance of SSE4 even more than it helps AVX.

In order to get full advantage of AVX with OpenMP, you may need to arrange not only that the arrays are 32-byte aligned, but that the array size is a multiple of 32 bytes times number of threads, with static scheduling, so that each thread gets an aligned data chunk.   

I suppose when you use all the hyperthreads you could find there is no advantage in AVX even if you have L1 data locality, so you need to investigate whether setting 1 thread per core is helpful.  Most people who post here don't find such suggestions acceptable.

Thanks for your post. At present I only have the ifort compiler installed on the Xeon(R) W3520, so all /QxHost.exe files are generated for that instruction set.

I have been assuming the Xeon instructions are 128 bit, while AVX (i7) are 256 bit, so potentially twice as fast.. I shall review your post and need to better understand the difference between SSE2, SSE4, SSE4.1 and SSE4.2.

I have always been puzzled why the issue of alignment was not addressed by having a suitable transfer instruction for non-aligned address, rather than requiring software to attempt to manage the problem.

John 

AVX on Core I7-3 "Ivy Bridge" has much improved hardware support for unaligned loads, but the compilers assume I7-2.  Full hardware support for 256-bit divide, stores and path to L2 cache is delayed to the next model.  This is a fairly common situation, where new features don't offer much advantage on the first CPUs which implement them.

Intel compiler options often fail to clarify the situation.  Although I7-1 supports SSE4.2 and that is what is chosen by /QxHost, SSE4.1 may run better.  Likewise, there is a special AVX variant chosen by /QxHost on I7-3, but I don't believe it actually supports anything new in ifort, but may prevent the generated code from running on I7-2.

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya