If you recall where we left off on my post yesterday we compiled a test program with gcc and saw this code for the 'working' part of a loop. (Yes, I will be getting to the Intel C++ compiler next post, but I'll stick with what I've got so far just so we can take baby steps).
This article details optimized implementations of data transformations and algorithms together with analysis comparing performance and providing speedup measurements for Intel® SSE optimized code and estimates for Intel® AVX optimized code.
This paper describes a case study in which AVX instructions are used to enhance the performance of a de-saturation algorithm (a common video filter). The case study takes the algorithm from a non-SIMD state to AVX based SIMD.