_mm_extract_ps returns int (for a long long time)

_mm_extract_ps returns int (for a long long time)


This issue looks like bad design or bug for a lot of programmers for many years. But problem is still there.

Why _mm_extract_ps returns int type? At first we can see intrinsics design features like _ps and _epi32 endings for float and int types respectively. We have _mm_extract_epi32 which calls pextrd instruction which return int type. And _mm_extract_ps uses extractps and return INT type again? But why? Will somebody fix it some day?

I want to write code like

template <int i> float get() const noexcept { return _mm_extract_ps(xmm_, i); }

and not like

template <int i> float get() const noexcept {
    int v = _mm_extract_ps(xmm_, i);
    float f;
    memcpy(&f, &v, sizeof(v)); // standard recommended cross-compiler type-punning for c++
    return f;

P.S. Also maybe somebody can explain why we need both extractps and pextrd assembly intructions when technically they are the same? I don't think they change some flags or do some checks anyway. Now I can't see the difference with

int _mm_extract_ps(__m128 xmm, int i) { return _mm_extract_epi32(_mm_castps_si128(xmm), i); }

Best regards, Vyacheslav

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Assuming you're writing 64bit code, then floats are stored in xmm registers anyway.

So really want you want is a vector register shuffle to just move the floating point value into the bottom of the vector register and then to use that register in scalar mode.

See doug65536's answer here;


So something like;

template <int i> float get() const noexcept { return _mm_cvtss_f32(_mm_shuffle_ps(xmm_, xmm_, _MM_SHUFFLE(0, 0, 0, i))); }


Sorry but please no such assumings. I need to use SIMD code on x86, x64 with cross-compilers and platforms (win, lin, mac).

Thank you for link anyway. I found _MM_EXTRACT_FLOAT as official solution, that's pretty interesting and fun. For me it looks like bad design. Still wonder to know the reason for this solution.

I don't think that using PORT5 is a good idea anyway. Maybe shift solution is more simple and faster for CPU to perform:

template<int i> [[nodiscard]] float __vectorcall _mm_get_ps(__m128 v) {
    return _mm_cvtss_f32(_mm_castsi128_ps(_mm_srli_si128(_mm_castps_si128(x), i * 4)));

Leave a Comment

Please sign in to add a comment. Not a member? Join today