What's the best way to sum up values in __m128 ?

What's the best way to sum up values in __m128 ?

It seems to me a very common operation.
_mm128 xx which contains values (xx3, xx2, xx1, xx0)

and I would like to do xx0 + xx1 + xx2 + xx3

right now, I can use :
xx = _mm_hadd_ps(xx, _mm_set_zero); // to get (0, 0, xx3+ xx2, xx1+xx0)

xx= _mm_add_ss(xx, _mm_shuffle_ps(xx,xx, _MM_SHUFFLE( 0, 0, 0, 1 )) );

Another way :

xx= _mm_add_ps(xx, _mm_movehl_ps(xx, xx));

xx= _mm_add_ss(xx, _mm_shuffle_ps(xx, xx, 1));

_mm_store_ss( &temp, xx );

Is there a better way? This seems a very common operation. Any plan to make it native?

Also, how about sum up numbers in more than one registries?


2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

your first hadd implementation requires 5 + 1 + 3 cycles (without the set_zero and without the store).
the second requires 1 + 3 + 1 + 3 cycles (again without the store)

here's another one, if you have xx in memory:

xx[0] = (xx[0] + xx[1]) + (xx[2] + xx[3])

It requires 3 + 1 + 3 cycles (without the loads and stores). But in this case the loads are probably going to make a difference so that one of the above should be faster.

In general, horizontal operations are not what SIMD is for. That's why your last question is so important. When you have more numbers to sum up you can do as many vertical adds as you have registers. E.g. you have have four __m128 registers a, b, c, and d. Then first you do

_mm_add_ps(_mm_add_ps(a, b), _mm_add_ps(c, d));

and then one of your hadd implementations. This is now much faster than the scalar equivalent.

Leave a Comment

Please sign in to add a comment. Not a member? Join today