float min_reduce(__m128 x)

float min_reduce(__m128 x)

I'm looking for an efficient way to implement a min_reduce on an __m128 vector. As far as I've seen there's no instruction available to do this so I tried the following:

float min_reduce(__m128 a) {
    a = _mm_min_ps(a, _mm_movehl_ps(a, a));   // a = min(a0, a2), min(a1, a3), min(a2, a2), min(a3, a3)
    a = _mm_min_ss(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 1, 1, 1))); // a = min(a0, a1), a1, a2, a3
    float r;
    _mm_store_ss(&r, a);
    return r;

It seems to work (at least for the cases I tested). But it looks more complicated to me than necessary. Is there something more efficient, or is this the best I can get already?

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

While on that topic. What's the same thing for a vector of shorts? Here's my idea:

short min(__m128i a) {
    a = _mm_min_epi16(a, _mm_shuffle_epi32  (a, _MM_SHUFFLE(1, 0, 3, 2)));
    a = _mm_min_epi16(a, _mm_shufflelo_epi16(a, _MM_SHUFFLE(1, 0, 3, 2)));
    a = _mm_min_epi16(a, _mm_shufflelo_epi16(a, _MM_SHUFFLE(1, 1, 1, 1)));
    return _mm_cvtsi128_si32(a); // & 0xffff is implicit  

That's quite a long dependency chain:
(only if the compiler creates the necessary MOVs such that they can run in parallel with the PSHUF*).
Any better ideas?

For FP, what you have is probably the best, pre-penryn target you might want to look at srlq,32 for the 2nd shuffle (pshuflw)
for shorts, if you know your values are always psoitive and have sse4 target, you can use phminpos

So srlq is faster on older processors and shufps is faster on newer processors where you have the 2 cycle penalty from going from a float vector -> int vector -> float vector, right?

phminpos I somehow overlooked. Ah, because it's not documented at http://www.intel.com/software/products/compilers/docs/clin/main_cls/mergedprojects/intref_cls/whnjs.htm. Thanks for the pointer.

Leave a Comment

Please sign in to add a comment. Not a member? Join today