When using SSE, I used to use _mm_add_ps twice and _mm_shuffle_ps to sum up all 4 values in __m128. For _m256, what is the best way? Also, I used to have mask like _MM_SHUFFLE(3,2,1,0) to create a mask for my _mm_shuffle_ps. How should I create mask for _mm256_shuffle_ps now? I don't see a _MM256_SHUFFLE? Thanks.
For more complete information about compiler optimizations, see our Optimization Notice.