Thanks for the response tim18. Any idea if auto-vectorization is planned for future compiler releases, or if gcc -ftree-vectorize -msse4.2 might already do this?
I do have an example where ifort -xsse4.2 uses the horizontal dot product, but only in a remainder loop, so it's not significant for performance. The expectation would be that horizontal dot product would be useful only in limited situations, such as where there is a fixed dot product length of 4. It may be that the code would be optimized automatically in that situation.
The same examples, with g++ or gfortran 4.5, generate identical code with sse4.1 or sse 4.2 options. While the gcc/g++/gfortran use of sse4 code shows some consistent performance gains over sse3, sse4.1 isn't used in the same ways in my code samples by gcc and Intel compilers, with the exception of the _mm_set_ps, where both compilers shift to sse4.1 code (so it's not necessary to shift source code to the corresponding sse4.1 intrinsic). g++ 4.5 has more effective auto-vectorization than previous g++.
I haven't found any use of sse4 code by the Sun compilers, but they frequently vectorize effectively for sse4.2 CPUs, using sse instructions, even in a few situations where the others don't.
The marketing people usually miss several points: the few situations where new instructions are beneficial are far outnumbered by those where the old instructions may be optimized better for the new CPUs. There isn't sufficient incentive to make applications incompatible with older CPUs, when the AVX instruction set will offer real gains in a year or two.