I'd like to make a few suggestions for additional SIMD instructions.
My main concern is the (automatic) vectorization of loops. SSE/AVX still miss a few instructions to make that something straightforward. Basically, it requires that every common operation (typically 32-bit) in a language like C has a vector equivalent.
A first missing instruction is a shift with independent shift amounts. SSE5 does have such an instrution (pshad). It would make it possible to vectorize a loop containing shift operations with non-constant shift amounts (the alternative code sequence is very long, while it should be fairly simple to implement in hardware).
Other missing operations are parallelscatter/gather. That's the vector equivalent of simple store/load operations. It would require 16 instructions to perform something like a gather operation with extract/insert instructions, for 8 x 32-bit elements. Compared to arithmetic operations that require just one instruction it's clear that thisquickly becomesa huge bottleneck.
Of course it can be argued that one should use Structure-of-Array data layouts and store/load entire vectors in one instruction, but that's simply not always possible. Automatic vectorizing compilers can't simply change the data layout. And even when the application developer is aware of the benefits of SoA some algorithms just require scatter/gather. Even for something like multiplying matrices a gather instruction would be of great help.
Other fields in which these instructions would be useful are multimedia, HPC, physics, ray-tracing, etc. Basically anything that uses a stream processing model with loops that can be parallelized with SIMD. My own field op expertise is 3D software rendering(for games, medical imaging, etc). In particular the gather instruction would help very considerably with texture sampling, especially with large filter kernels (anisotropic filtering).
I realize it's a huge challenge to implement a fast but generic gather instruction but it looks like a crucial step to increase SIMD performance. As vectors get wider and arithmetic performance goes up,gathering data elements from different memory locations is quickly becoming the bottleneck and preventing the processor from ever reaching its theoretical throughput.
Note that a first implementation doesn't necessarily have to be capable of loading 8 x 32-bit elements all in parallel. Even loading two or four elements per cycle would be a worthwhile improvement compared to the alternatives. Also, memory ordering doesn't have to be that strict. For stream processing the input and output typically doesn't overlap, and to synchronize with other threadsthe existingmemory fence instructions could be used to ensure that all vector elements have been read/written before any other memory operation executes.
I'd really love to discuss the possibility of adding these instructions to future AVX extensions. I believe they could revolutionize CPU performance in the next decade, by potentially increasing performance 8-fold for loops that previously processed just one stream element per iteration...