Hello, I would like to make a suggestion
Very often [otherwise well vectorizible] algorithms require reading/writing from/to mem addresses which are calculated per-channel (reading from table, sampling a texture, etc.).
When you get to this, you are forced to make that part of the algorithm scalar by extracting each channel in turn to a GP register, performing the memory operation and then inserting the result back to a vector register.
I don't think a single instruction that interprets each channel as an address and reads/writes to different memory locations at once is hardware feasible (though it would be extremely good) but at least we could have something that would ease the situation.
my suggestion is instructions for memory access that get the address directly from the sse/avx register:
loadd $(i + (j<<4)), %xmm0, %xmm1 - read 32-bit word from address specified in the i-th dword of xmm0 and store it in j-th quarter of xmm1
stored $(i + (j<<4)), %xmm0, %xmm1 - read 32-bit word from j-th quarter of xmm1 and store it to address specified in the i-th dword of xmm0
+ variants for 64-bit addresses and other data sizes (loadw, loadq, loaddq, storew, storeq, storedq), etc.
something like that, you get the idea :)