I took the sample vectorization code matrix_vector_multiplication_f and modifed it a little to use allocatable memory. Then I compiled the code using two options: 1) /QxSSE4.2 and /QaxAVX; 2) /QxSSE4.2 and ran both on E5-2690. I was expecting for the single precision an 8x peak performance improvement with AVX and a 4x improvement with SSE4.2 by vectorization. But the results I got didn't match my expection. I observed 4x using SSE comparing to novec version but only <5x speedup for AVX. Did I miss anything?
I used Fortran compiler XE 18.104.22.168 and Visual Studio 2008 Version 9.0.30729.1 SP. The OS is Windows Server 2008 R2 Standard SP1. 32-byte alignment and ipo are applied. The baseline is compiled with -O1 and vectorized versions are compiled using -O3. I also varied the number of columns of the array and noticed performance drops as the total data size reaches 32KB and 256KB. I guess that is due to L1 and L2 cache miss. Is it correct?