AVX horizontal sum and store

AVX horizontal sum and store


I would like to do a full horizontal sum. I have variable a and b of type __m256d. Now I want to get a[0]+a[1]+a[2]+a[3]+b[0]+b[1]+b[2]+b[3] and store it.

In SSE I could easily do this with a = _mm_add_pd(a,b) and a =  _mm_hadd_pd(a, unused).

The next thing is how to store only, one value of an AVX register. In SSE I just used _mm_store_sd. I could do a cast from __m256d to __m128d and use SSE instruction. But Intel® Architecture Code Analyzer tells me this will result in a penaly because of swtich from AVX to SSE.

In AVX is it correct that the _mm256_maskstore_pd replaces the _mm_store_sd with the corrrect mask. Is there is a possibility to use a fixed mask? So far I declare a variable and use it, but as the mask is static, I consider this as unecessary overhead:

__m256i storeMask = _mm256_set_epi32(0, 1<<31, 0, 0, 0, 0, 0, 0);
_mm256_maskstore_pd(&res, storeMask, a);

And one last thing, how could I acchieve the same thing for __m256 (float). There are shuffle and permute functions, but all in all its hard to get what I want. I haven't found I way so far.

I would appreciate any tipps and hits.

Kind regards,


9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Can anyone give me a hint?

I think I need some data permutation function combined with the horizontal adds? But there are so many data distribution function with parameters. How do I find the correct ones, or the simplest solution?

Now I found another interesting thing: There are also masked load and store operation for SSE but they are only supported on processors supporting AVX? Isn't this a little bit weird?

Many thanks!


If you use intrinsic and use compiler switch that generates AVX code, then 128-bit intrinsic should actuatally produce AVX.128 flavor of instructions.

let me restate the starting point of horizontal add of vector lengths of 2 doubles and 4 singles

vector-2 DP: t = _mm_add_pd(a, b); h = _mm_hadd_pd(t, t); _mm_store_sd(h);

vector-4 SP: t = _mm_add_ps(a, b);
h = _mm_hadd_ps(t, _mm_shuffe_epi32( _mm_castsi128_ps(_mm_castps_si128(t)), 0x4d));
_mm_store_ss( _mm_add_ps(h, h));

then using AVX,

vector-4 DP: t = _mm256_add_pd(a, b); h = _mm256_add_pd(t, _mm256_permute2f128_pd(t, t, 0x1));
_mm_store_sd( _mm_hadd( _mm256_castpd256_pd128(h), _mm256_castpd256_pd128(h) ) );

vector-8 SP: t = _mm256_add_ps(a, b);
h = _mm256_hadd_ps(t, _mm256_shuffe_epi32( _mm256_castsi256_ps(_mm256_castps_si256(t)), 0x4d));
h2 = _mm256_add_ps(h, _mm256_permute2f128_ps(h, h, 0x1));
_mm_store_ss( _mm256_castps256_ps128(h2));

VMASKMOV instruction are no substitute for VMOVSD.
VMASKMOV can selectively store elementes into memory as a vector where masked element will leave gaps in the corresponding memory locations. The intended software usage of VMASKMOV should not tried to consume (load) that vector in memory immediately. If immediate consumption is needed, then you want to work with normal store operations using VMOVDQU/VMOVD/VMOVQ.

I found out this interesting feature of the compiler generating AVX-128 instructions from SSE2 intrinsics. I didn't see it documented anywhere and haven't persuaded any customers to make use of it, even those who have already written both SSE2 and AVX intrinsics versions of their application.

Thank you for your answers!

I try out the code now.

One more question to the code: Also in AVX you use _mm_store_ss and _mm_store_sd. If I compile with AVX does this mean I get AVX 128 bit instructions and thus there is no aVX-SSE switching penalty?

When it comes to VMASKMOV you mentioned immediate consumption. Could you explain this more detailed? In my case I would store data in memory but not use it again the next time as it is the result and I operate on a very large vector.

Yes, my understanding is the Intel C++ compilers automatically generate AVX-128 from SSE2 intrinsics, when AVX compile line option is set. This would avoid the requirement for vzeroupper instructions when these instructions are mixed with AVX code. You could check by examining asm code or by profiling for switching penalties. Use of AVX-128 also solves many issues with misalignment.
The recent compilers will place vzeroupper instructions in function entry and exit sequences so as to avoid switching penalty there.
I'm not the expert on your question about vmaskmov, but my understanding is that it doesn't support effective cacheing. The compilers select vmaskmov only when they don't find any other means to enable vectorization. In that case vmaskmov can prove advantageous when data are re-used from last level cache.

I will use Intel Architecture Code Analyzer to examine this and post my results.

OK, so far I won't use vmaskmov instruction.

I made some changes to the code snippets provided. I changed AVX version for float, as integer operations are not available under AVX. At least I could not compile them. Here is my working solution:

// sums up 4 double elements horizontal

void HsumSseDbl(const double * const adr) {

	double sumSSE;

	__m128d sse1 = _mm_loadu_pd(&adr[0]);

	__m128d sse2 = _mm_loadu_pd(&adr[2]);
	__m128d hsum = _mm_add_pd(sse1, sse2);

	hsum = _mm_hadd_pd(hsum, hsum);

	_mm_store_sd(&sumSSE, hsum);
	cout << "Hsum SSE Double: " << sumSSE << endl;


// sums up 4 double elements horizontal

void HsumAvxDbl(const double * const adr) {

	double sumAVX;

	__m256d avx = _mm256_loadu_pd(&adr[0]);
	__m256d hsum = _mm256_add_pd(avx, _mm256_permute2f128_pd(avx, avx, 0x1));

	_mm_store_sd(&sumAVX, _mm_hadd_pd( _mm256_castpd256_pd128(hsum), _mm256_castpd256_pd128(hsum) ) );
	cout << "Hsum AVX Double: " << sumAVX << endl;


// sums up 8 float elements horizontal

void HsumSseFlt(const float * const adr) {

	float sumSSE;

	__m128 sse1 = _mm_loadu_ps(&adr[0]);

	__m128 sse2 = _mm_loadu_ps(&adr[4]);
	__m128 hsum = _mm_add_ps(sse1, sse2);

	hsum = _mm_hadd_ps(hsum, hsum);

	hsum = _mm_hadd_ps(hsum, hsum);

	_mm_store_ss(&sumSSE, hsum);
	cout << "Hsum SSE Double: " << sumSSE << endl;


// sums up 8 float elements horizontal

void HsumAvxFlt(const float * const adr) {

	float sumAVX = 0;

	__m256 avx = _mm256_loadu_ps(&adr[0]);
	__m256 hsum = _mm256_hadd_ps(avx, avx);

	hsum = _mm256_add_ps(hsum, _mm256_permute2f128_ps(hsum, hsum, 0x1));

	_mm_store_ss(&sumAVX, _mm_hadd_ps( _mm256_castps256_ps128(hsum), _mm256_castps256_ps128(hsum) ) );
	cout << "Hsum AVX Double: " << sumAVX << endl;


// EDIT:
As to AVX128 and penalties: I tested with Visual Studio 2010. If you compile the function HsumAvxFlt from above with /arch:AVX the Architecture Code Analyzer Version does not show any penalty. If you compile using /arch:SSE2 you get the penalties.

I suppose if even old Visual Studio does this thing right then Intel Compiler should do it correct, too. Maybe anyone could test it out.

I have done some further reading on the AVX128 and arch:AVX.

Is it correct that the compiler option arch:AVX is a possibility to generate VES encoded instructions. And in this mode SSE intrincs get compiled to VES encoded instructions. This way there is no penalty?

Yes, with /arch:AVX, ICL apparently aims to avoid SSE to AVX transition penalties by promotion to AVX, even interior to a function using SSE intrinsics which can be promoted to AVX-128. Of course, the generated code would no longer run on a non-AVX platform. I haven't seen this in writing, so, if you have a counter-example, it might be interesting to show.

Leave a Comment

Please sign in to add a comment. Not a member? Join today