Intrinsics - types and inits

Intrinsics - types and inits

ICC cannot compile the following:

__m128d x = {0.7, 1.1};
__m128 y = {0.7f, 1.1f, 2.7f, 3.1f};

Initializers are standard, not varied like in __m128i for example... so why not? BTW, VC6 allows this, preserving the memory order, however.

Also, __m128 is not compatible with __m128i or __m128d, e.g.

__m128 a = _mm_set_ps(0.7f, 1.1f, 2.7f, 3.1f);
__m128 b = _mm_shuffle_epi32(a, 0xB0); // error

Please note that I'm aware of conversions (_mm_cvt*), pointer casting (*(x*)&y) and unions. However, the former method changes the value and issues other instruction(s). Both casting and unions seem to require the unnecessary memory access, at least in ICC7; VC optimizes (eliminates) a load after a store to the same location.

Is a reason for this lack of compatibility (on intrinsic level) is:

- conversion-proof code? (so why not a warning instead of an error?)
- future compatibility/performance issues? (so why VC does this?)
- cosmetic? (in VC this may be cosmetic, in ICC it is not)
- any other reason?

Best regards,

Anna Niedzicka

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


We made a conscious decision not to allow these initializers, because doing so would have undesirable side effects. For example, you can write the following using VC6.

__m128 y = {0.7f, 1.1f, 2.7f, 3.1f};
float f = y.m128_f32[2];

Accessing the individual elements of y in this manner can lead to undesirable and possibly unexpected performance problems. This sequence is effectively the same as using

float f = ((float*)&y)[2];

You've already observed the negative consequences of casting, namely memory accesses that are otherwise unnecessary.

The recommended alternative method of initialiation is to use intrinsics. For example,

__m128 y = _mm_set_ps(3.1f, 2.7f, 1.1f, 0.7f);

This method always works in C++, and it works for local non-static variables in C. For global and static variables in C, you either need to initialize in code using intrinsics, or you need to use a union as follows.

union {
float f[4];
__m128 m;
} y = {0.7f, 1.1f, 2.7f, 3.1f};

There should be no performance drawbacks to this union provided that all other references to it are through y.m.

We also made a conscious decision to use strict typing for the XMM data types. This avoids potential performance problems with future processor generations. You can freely mix types on a Pentium 4 processor without penalty, but that might not be true for future processors.

For the specific case you raise, you could use the following equivalent code that doesn't mix types.

__m128 a = _mm_set_ps(0.7f, 1.1f, 2.7f, 3.1f);
__m128 b = _mm_shuffle_ps(a, a, 0xB0);

David Kreitzer
IA32 Code Generation Group

Leave a Comment

Please sign in to add a comment. Not a member? Join today