AVX vs. SSE4.2 performance on Sandybridge

AVX vs. SSE4.2 performance on Sandybridge

I took the sample vectorization code matrix_vector_multiplication_f and modifed it a little to use allocatable memory.  Then I compiled the code using two options: 1) /QxSSE4.2 and /QaxAVX; 2) /QxSSE4.2 and ran both on E5-2690.  I was expecting for the single precision an 8x peak performance improvement with AVX and a 4x improvement with SSE4.2 by vectorization.  But the results I got didn't match my expection.   I observed 4x using SSE comparing to novec version but only <5x speedup for AVX.  Did I miss anything?

I used Fortran compiler XE 13.0.1.119 and Visual Studio 2008 Version 9.0.30729.1 SP.  The OS is Windows Server 2008 R2 Standard SP1.  32-byte alignment and ipo are applied.  The baseline is compiled with -O1 and vectorized versions are compiled using -O3.  I also varied the number of columns of the array and noticed performance drops as the total data size reaches 32KB and 256KB.  I guess that is due to L1 and L2 cache miss.  Is it correct?

 

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

>>...I was expecting for the single precision an 8x peak performance improvement with AVX and a 4x improvement with SSE4.2
>>by vectorization. But the results I got didn't match my expection. I observed 4x using SSE comparing to novec version but
>>only <5x speedup for AVX. Did I miss anything?

Your expectations could be valid, especially with ...8x peak performance improvement with AVX..., but only in case when some calculations are done for 8 single-precision values and there are No cache lines related overheads. As soon as size of a data set grows then performance decreases because of load and store operations.

In reality... We recently completed a set of tests SSE vs. AVX on Sandy Bridge vs. Ivy Bridge and a range of performance improvement was between ~3x and ~6x ( for sqrt operation ) and the codes ( C/C++ ) were agressively optimized by Intel C++ compiler 13.0.0.089 ( Initial Release ).

>>...I also varied the number of columns of the array and noticed performance drops as the total data size reaches 32KB and
>>256KB. I guess that is due to L1 and L2 cache miss. Is it correct?

Possibly Yes and please verify sizes of L1 and L2 cache lines for your CPU in a Datasheet ( Pdf-document / always on the right part of a web-page ) on Ark.intel.com.

Matrix multiplication is an ideal application for demonstration of AVX performance.  It depends strongly on tiling for L1 locality, thus the renewed emphasis on performance libraries such as MKL. 

You may notice with -O3 compilation that the Intel Fortran could perform automatic unroll-and-jam transformation so as to reduce the number of data reads and writes, but will not do so as aggressively as the MKL library code.  In my experience, the MKL should begin to show an advantage as early as the case of minimum dimension 32.

Leave a Comment

Please sign in to add a comment. Not a member? Join today