VisualStudio compiler parallelisation options and AVX

VisualStudio compiler parallelisation options and AVX

Hi all,

Basic question.

I have a Fortran code with openMP directives and would ideally like to compile for the new sandybridge processors. The code is a floating point intensive modelling code that would likely benefit from AVX.

Are the Sandybridge extensions incorporated within the current intel fortran compiler?
If yes can I simply select the processor that I want the code to be compiled for. I am happy with allowing the compiler to decide how to parallelise and or just use a few manually inserted omp directives.

I currently have VS2008 with: Intel Visual Fortran Compiler Integration Package ID: w_cprof_p_11.1.065

Thanks in advance!
Al

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

Hi,
Yes, the 11.1 and 12.0 Fortran compilers support Sandy Bridge, as do the performance libraries. See http://software.intel.com/en-us/articles/how-to-compile-for-intel-avx/ .
The version 12 compiler, contained in Intel Visual Fortran Composer XE, contains more optimizations for Sandy Bridge than the 11.1 compiler that you are using. So if your support is current, you might consider upgrading.

If you have floating-point loops that can be vectorized, you'll likely get better performance on Sandy Bridge if the datacan be32 byte aligned.
OpenMP is typically more effective than automatic parallelization because you can parallelize at a higher level, provided you understand the dependencies in your code. There are articles on automatic parallelization at http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers/ and http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/ .

Martyn

Fantastic, thats exactly the response I was hoping for. Yes my support is current so I will update and thanks for the links

hi again

I compiled my code in intel 11.1 but included the /QxAVX flag (along side /QParallel) but didnt specify code to be optimised for any particular processor / instruction set (apart from AVX). When I ran the code on the computer with the i5_2500K processor an error message was immediatly displayed saying the processor did not support AVX. I know its a sandybridge processor. Do you have any ideas on what I am doing wrong?

Thanks
Alastair

Hi Alastair,
I think you are doing everything right. I will check the Windows 7 version. Is it Windows 7 SP1? Windows has AVX support only after SP1. If it is not Service pack 1 then Windows rejected the application.

That is to say, you may be able to create the .exe, but without SP1 it will not run, and may not even give a diagnostic.

Just to add, you can run the application for functional correctness with SDE
software/intel/com/en-us/avx

Thanks again folks.

Its a new install of Win7 64 so its entirely possible SP1 has not been installed. ill check this eve and post on Monday.

Thanks
Al

Im not having much joy with this. Perhaps in part as i cant readily test whether AVX is being used. If it is being used the difference is not apparent in runtime which is surprising as im testing a very computationally intensive modelling code (lots of simple matrix multiplys)

Thoughts:
Ive installed all updates to Windows, its activated. Is SP1 a RC at the moment? is that RC(?) required for AVX to be supported?

The system is overclocked (stable) to 4.7Ghz, is it possible for AVX to have been inadvertently turned off in BIOS? Memory is 1600. I assume this makes no difference.

With respect to compiler flags in VS2008 (fortran 11.1) ive tried various combinations. I was anticipating one of below to be ok: (have tried various combinations of Qpar and Qvec thresholds also with no significant runtime change)

/nologo /QxAVX /Qparallel /Qpar-threshold:0 /assume:buffered_io /assume:byterecl /module:"Release\" /object:"Release\" /libs:static /threads /c

/nologo /arch:SSE3 /QxAVX /Qparallel /Qpar-threshold:0 /Qvec-threshold:0 /assume:buffered_io/assume:byterecl /module:"Release\" /object:"Release\" /libs:static /threads /c

/nologo /QxSSSE3 /QxAVX/Qparallel /Qpar-threshold:0 /Qvec-threshold:0 /assume:buffered_io /assume:byterecl /module:"Release\" /object:"Release\" /libs:static /threads /c

There are certainly some areas of the code where there are potentially dependant variables within big loops. These could be set as private / not dependent in openMP. All im trying to do at the moment thopugh is get the AVX working. After that ill spend time going through the code and trying to give the compiler instructions on what is dependent or not.

What do you think im doing wrong?

Thanks
Al

Windows 7 SP1 is easily found by a browser search. It seems to be advertised as a beta test, presumably meaning they can expire it and require you to back it off. Yes, you must have it installed in order to run AVX code.
Among the requirements for AVX to give a performance boost would be:
All data 32-byte aligned.
/Qvec-report shows vectorization under /QxAVX -O3
Data local to L1 data cache (if your matrices are large enough, you would require a current MKL). Your overclocking would increase the importance of this.

Attempting to cater to a variety of older CPUs is probably counter-productive when trying to maximize performance on a single CPU type. If you can't get 16-byte alignments, /QxSSE3 may be faster than newer architcture options for double precision, although I haven't tried this on SNB.

thanks again. ill give it a go and post - hopefully - tomorrow
Ta
Al

You can check the assembly code (create with /FAs) in one or two of your kernels to see whether AVX instructions, and especially 256 bit AVXfloating-pointinstructions, are being generated.
Look for vectorinstructions that begin with a "v", such as vaddpd and vmulpd. 128 bit instructions have xmm registers as arguments, 256 bit instructions have ymm registers as arguments. In the present processor generation, only floating-point instructions have 256 bit versions. These are what you hope will give your matrix multiplications a speedup. As Tim says, the vectorization report (either -/Qopt-report-phase:hpo or /Qvec-report2) will tell you whether a loop was vectorized successfully using packed simd instructions, but it won't tell you whether 128 or 256 bit instructions were used. /Qopt-report-phase:hlo gives additional information about loop optimizationsif you compile with /O3. Note that the compiler may generate multiple code versions for a single loop, some vectorized, some not, depending on iteration count, dependencies or data alignment at run-time.

I see that you compile with /Qparallel. In the 12.0 compiler, if you compile with /Qparallel /O3, the compiler may substitute an optimized (including threaded)library call for recognizable matrix multiply loop nests. There's also an optimized Fortranintrinsic MATMUL that you could call. I do not recommend the use of/Qpar-threshold:0 and /Qvec-threshold:0. This may thread and vectorize many unsuitable or short loops, and create a lot of unnecessary overhead, especially for threading. For vectorization, just use the default. If you want to experiment with auto-parallelization, try /Qpar-threshold:99. This allows loops to be threaded without requiring that the iteration count be known at compile time.

In the example I posted on the threading forum today, /Qpar-threshold88 is the highest value for which combined vectorization and threading occurs, where the array size is hidden from the compiler (using the 12.0 release which was announced yesterday). The higher settings of par-threshold are influenced by the expectation that threading isn't productive when the loop has the assumed length 100. Setting
!dir$ loop count(9999)
produces auto-parellization without depending on par-threshold. When aligned data are asserted, it doesn't auto-parallelize even at /Qpar-threshold0.
I agree with Martyn that lower values of par-threshold lead to counter-productive threading. I found it nearly impossible to get effective results from /Qparallel with the 11.x compilers, on code samples other than those which were used to tune the compiler. The re-implementation of loop count directive provides an excellent alternative to simply requesting more aggressive parallelization when the compiler assumes too low a trip count.
With the current C++ compilers at /O3, Martyn's caution about multiple code versions is well taken; the reports will show vectorization or auto-parallelization without telling whether the expected code path is among those which are optimized.

Leave a Comment

Please sign in to add a comment. Not a member? Join today