I have been asking many questions about the SSE4.2 and AVX compiler optimization; here is my paraphrase of some answers:
Q. why doesn't SSE4.2 optimization make a difference to the compilers?
1. versioning for vector alignment
background: Intel CPUs exhibit reduced store latency when parallel stores are 16-byte aligned. Intel and gnu compilers peel up to 3 initial loop iterations into a scalar loop, in order to use parallel aligned stores, such as movaps.
Sun compilers don't perform this alignment; they use only 64-bit stores, in case there may be misalignment. This may work better for short vectorizable loops of unknown alignment.
In addition, Intel CPUs prior to SSE4.2 performed so much better with aligned loads that it was worth while to make 2 versions of vector code in loops with at most 3 operands, so as to optimize the case where all operands are consistently aligned, but also handle the case of differing alignment. Only the Intel compilers generate this additional version of a loop. Presentations about SSE4.2 architecture point out that it would be preferable to make only a single version.
The gnu compiler choices to avoid performance problems with unaligned vectorized loads are splitting into 64-bit loads (the default, not good for recent CPUs), or the use of movaps and the like in spite of misalignment (-march=barcelona), so as to optimize for the named CPU and (with -msse4) for sse4.2.
Q. Why doesn't Intel SSE4.2 suppress the unnecessary versioning?
A. AVX would restore the importance of this versioning. Don't want to break future optimizations.
Q. Why are there 2 versions, with additional scalar iterations for alignment, even when the code in the 2 versions is identical?
A. Compiler can't predict when the 2 versions will be identical at the time when the decision is finalized to make 2 versions.
Q. Why are there 2 versions, even for some cases of 2 operands, where the alignment adjustment assures that 1 of those versions is never executed?
A. It's a rare case, with no known performance impact, in the recent compilers which pick the correct version at run time.
2. optimizations specific to Intel -xSSSE3 may not be desirable in -xSSE4.2
Q. Use of palignr to avoid fetching data multiple times with varying alignments slows loop startup, and may show no advantage on SSE4.2. SSE4.2 code doesn't run on the CPUs which benefit from this "optimization;" how can this complicated code generation be avoided?
A. In practice, production code for SSE4.2 processors is usually built with SSE2 or SSE3 optimization. For AVX, when we make a multiple architecture compilation, we will allow the current "Intel microarchitecture code-named Nehalem" CPUs to run the SSE2 (-axAVX) or SSE3 (-axAVX -mSSE3) code. Future compilers should have performance heuristics to decide when there is value in producing vectorized versions for both requested architectures. It's probably not worth while to include SSSE3 or sse4 versions when there are already 2 architecture versions.
3. Intel compilers ignore many opportunities for scalar replacement in loop-carried recursion, e.g.
do j= 2,n
Reading the last stored value (presumably by a store-forward bypass on the memory buss) introduces a significant delay, compared against avoidance of reload by scalar replacement optimization, on SSE4.2 CPUs, as well as non-Intel CPUs. This may prevent any performance gain for SSE4.2 CPUs over older ones.
Q. How do we know when we must dictate scalar replacement with a local temporary in source code?
A. Look for loops which perform better with Sun or gnu compiler than with Intel, or find hot spots with a profiler, and look for latencies associated with reload. When Microsoft compilers don't perform such optimizations automatically, source code tuning is expected.