strange behaviour - _mm_move_ss()

strange behaviour - _mm_move_ss()

Hi All,

The behaviour of _mm_move_ss() is unpredicted and its different from expected behaviour with Intel compiler in release mode.

I used the intrinnsic _mm_move_ss() for copying data from one xmm reg to another xmm reg. 

Ex:

//vrz=vrx

vrz = _mm_move_ss(vrx, vrx) - does not work in release mode but works in debug  mode.

If we pass two different arguments to _mm_move_ss() then the behaviour is ok in release mode.

vrz = _mm_move_ss(vrx, _mm_set1_ps(0.0)); - works in release  mode

Is there any restriction on arguments?

What could be the reason for this behaviour?

Note: I used below options in release mode:  

/Zi /nologo /W3 /O2  /D "_MBCS" /EHsc /MT /GS /QxCORE-AVX2 /Zc:wchar_t /Zc:forScope /Fp"Release\AlgoRomLib.pch" /Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd 

 

Thanks,

Eswar Reddy K

34 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

>>...vrz = _mm_move_ss(vrx, vrx) - does not work in release mode but works in debug mode...

Could you provide a complete test case that demonstrates the issue?

Here is test case:

class data32
{
public:
typedef union union32
{
int i;
float f;
} union32;

data32() {}
data32(int ii) { val.i = ii;};
data32(unsigned int ii) { val.i = (int)ii;};
data32(unsigned long ii) { val.i = (int)ii;};
data32(float ff) { val.f = ff;};
inline data32 & operator= (const int & i) { val.i = i; return *this; }
inline data32 & operator= (const float & f) { val.f = f; return *this; }
inline operator int() {return val.i;}
inline operator unsigned int() {return (unsigned int) val.i;}
inline operator unsigned long() {return (unsigned long) val.i;}
inline operator float() {return val.f;}
private:
union32 val;
};

void test_move_ss()

{

__m128 in, out;
in = _mm_set1_ps(1.0);
out = _mm_move_ss(in, in);

printf("%f\n",(float)((data32)_mm_extract_epi32(_mm_castps_si128(out),0)));
printf("%f\n",(float)((data32)_mm_extract_epi32(_mm_castps_si128(out),1)));
printf("%f\n",(float)((data32)_mm_extract_epi32(_mm_castps_si128(out),2)));
printf("%f\n",(float)((data32)_mm_extract_epi32(_mm_castps_si128(out),3)));

}

Release mode:

0.000000
0.000000
0.000000
0.000000

Debug mode:

1.000000
1.000000
1.000000
1.000000

I reproduced that strange output however everything is right with how _mm_move_ss intrinsic function is working ( actually, MOVSS instruction ). I'll provide more technical details soon.

Here are some details.

Eswar, Take a look at a non-default constructor of the data32 class:

class data32
{
public:
...
data32( int ii )
{
val.i = ii;
};
...
};

Case 1: Let's say ii = 1065353216, then as soon as initialization is completed:

val.i equals to 1065353216, and
val.f equals to 1.0

And,

Case 2: Let's say ii = 1, then as soon as initialization is completed:

val.i equals to 1, and
val.f equals to 1.401e-045#DEN

This is how unions work and you should always remember about it.

Please do a couple of small modifications in your data32 class as follows ( in order to simplify debugging ):
...
class data32
{
public:
typedef union union32
{
int i;
float f;
} union32;

public:
data32()
{
val.i = 0; // Added by SergeyK
};
data32( int ii )
{
val.i = ii; // Set Breakpoint here!
};
data32( unsigned int ii )
{
val.i = ( int )ii;
};
data32( unsigned long ii )
{
val.i = ( int )ii;
};
data32( float ff )
{
val.f = ff;
};

inline data32 & operator=( const int &i )
{
val.i = i; return *this;
};
inline data32 & operator=( const float &f )
{
val.f = f; return *this;
};
inline operator int()
{
return val.i;
};
inline operator unsigned int()
{
return ( unsigned int )val.i;
};
inline operator unsigned long()
{
return ( unsigned long )val.i;
};
inline operator float()
{
return val.f; // Set Breakpoint here!
};

private:
union32 val;
};
...

Another thing is the name is your union.

It is called as union32. So, in that case I would use a data type __int32 for the member i, and I wouldn't use size_t for declaration because in case of compilation for a 64-bit operating system sizeof( i ) will be equal to 8.

I'm still investigating what is wrong with your output but I see already that something is wrong with a part that outputs contents of members. As I've told already there is nothing wrong with _mm_move_ss.

The processing is as follows when data are displayed:

(1) Non-default C++ constructor data32( ... ) -> (2) C++ operator float( ... ) -> printf( ... )

and try to debug by yourself in order to see internals.

Here is a set of test cases that work properly in both configurations ( Debug and Release ):
...
// Sub-Test 82 - Issues with '_mm_move_ss' intrinsic function

__m128 in = { 0.0f, 0.0f, 0.0f, 0.0f };
__m128 inA = { 0.1f, 0.2f, 0.3f, 0.4f };
__m128 inB = { 0.5f, 0.6f, 0.7f, 0.8f };
__m128 out = { 0.0f, 0.0f, 0.0f, 0.0f };

// in = _mm_set1_ps( 1.0f );
// out = _mm_move_ss( in, in );
out = _mm_move_ss( inA, inB );

// Test-Case 1
printf( "Test-Case 1\n" );
printf( "%f\n", out.m128_f32[0] );
printf( "%f\n", out.m128_f32[1] );
printf( "%f\n", out.m128_f32[2] );
printf( "%f\n", out.m128_f32[3] );

// Test-Case 2
printf( "Test-Case 2\n" );
printf( "%f\n", ( float )out.m128_f32[0] );
printf( "%f\n", ( float )out.m128_f32[1] );
printf( "%f\n", ( float )out.m128_f32[2] );
printf( "%f\n", ( float )out.m128_f32[3] );

// Test-Case 3
printf( "Test-Case 3\n" );
printf( "%f\n", ( float )( data32 )out.m128_f32[0] );
printf( "%f\n", ( float )( data32 )out.m128_f32[1] );
printf( "%f\n", ( float )( data32 )out.m128_f32[2] );
printf( "%f\n", ( float )( data32 )out.m128_f32[3] );

// Test-Case 4
printf( "Test-Case 4\n" );
printf( "%f\n", ( float )( ( data32 )_mm_extract_epi32( _mm_castps_si128( out ), 0 ) ) );
printf( "%f\n", ( float )( ( data32 )_mm_extract_epi32( _mm_castps_si128( out ), 1 ) ) );
printf( "%f\n", ( float )( ( data32 )_mm_extract_epi32( _mm_castps_si128( out ), 2 ) ) );
printf( "%f\n", ( float )( ( data32 )_mm_extract_epi32( _mm_castps_si128( out ), 3 ) ) );
...

Output in Debug configuration is as follows:
...
Application - IccTestApp - WIN32_ICC ( 32-bit ) - Debug
Tests: Start
> Test1017 Start <
Test-Case 1
0.500000
0.200000
0.300000
0.400000
Test-Case 2
0.500000
0.200000
0.300000
0.400000
Test-Case 3
0.500000
0.200000
0.300000
0.400000
Test-Case 4
0.500000
0.200000
0.300000
0.400000
Test Completed in 0 ticks
> Test1017 End <
Tests: Completed
Memory Blocks Allocated : 0
Memory Blocks Released : 0
Memory Blocks NOT Released: 0
Memory Tracer Integrity Verified - Memory Leaks NOT Detected

Deallocating Memory Tracer Data Table
Completed
...

Output in Release configuration is as follows:
...
Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release
Tests: Start
> Test1017 Start <
Test-Case 1
0.500000
0.200000
0.300000
0.400000
Test-Case 2
0.500000
0.200000
0.300000
0.400000
Test-Case 3
0.500000
0.200000
0.300000
0.400000
Test-Case 4
0.500000
0.200000
0.300000
0.400000
Test Completed in 0 ticks
> Test1017 End <
Tests: Completed
...

Please also consider two more cases which break compilation ( in order to verify data type casts / they need to be commented out as soon as verification is done ):
...
// Test-Case 5 - Error: invalid type conversion: "__m128" to "float"
printf( "Test-Case 5\n" );
printf( "%f\n", ( float )out );
printf( "%f\n", ( float )out );
printf( "%f\n", ( float )out );
printf( "%f\n", ( float )out );

// Test-Case 6 - Error: no suitable user-defined conversion from "__m128" to "data32" exists
printf( "Test-Case 6\n" );
printf( "%f\n", ( float )( data32 )out );
printf( "%f\n", ( float )( data32 )out );
printf( "%f\n", ( float )( data32 )out );
printf( "%f\n", ( float )( data32 )out );
...

Let me know if you need some tips on debugging in Release configuration. And one more thing. A test case for the union32 would be nice to have.

Eswar, I looked at Intel SDE manuals for additional verification and this is a summary of what _mm_move_ss does:
...
Sets the low word to the single-precision, floating-point value of b.

__m128 _mm_move_ss( __m128 a, __m128 b );

MOVSS

The upper 3 single-precision, floating-point values are passed through from a.

r0 := b0
r1 := a1
r2 := a2
r3 := a3
...

Once again, I don't see any problems with _mm_move_ss. It is actually from a Principal set of SSE instructions ( see xmmintrin.h / almost 15-year-old ).

Sergey,

Thanks for the detailed analysis.

out = _mm_move_ss( in, in );// => single var fails!
out = _mm_move_ss( inA, inB );//=> two different vars works in release mode.

When I use two different variables for _mm_move_ss() then it works in release mode. If we use single variable then it fails in release mode. 

>>Thanks for the detailed analysis.
>>
>>out = _mm_move_ss( in, in );// => single var fails!

Eswar,

I need a detailed prove of it, like screenshots with generated assembler codes, Visual Studio's Watch and Register windows.

Once again, your test case fails on output of values and I did a verification that MOVSS instruction does a right job. However, I did Not complete my investigation and I'll post my results as soon as it is completed ( I'll review _mm_move_ss( in, in ) test case again ).

My question is: Did you Debug Release configuration of your test application?

Note: It's the weekend and let's take some break...

Here is a screenshot ( a prove that MOVSS instruction works correctly ) and take a look:

Attachments: 

AttachmentSize
Downloadimage/jpeg investigationresults.jpg163.93 KB

Eswar,

Please provide technical details on CPU you have on your computer and Intel C++ compiler you're using.

I've finally reproduced the problem on a computer with Ivy Bridge processor but so far I do not have an exact answer on what is wrong.

To summarize: The test-case from the initial post works on a computer with Pentium 4 processor and fails on a computer with Ivy Bridge processor.

What is your progress?

Here is an Update:

1. This is Not a problem with MOVSS instruction and there is incorrect code generation by some major versions Intel C++ compiler ( 13.x - confirmed / 14.x - not confirmed yet ).

2. Intel C++ compilers starting from version 13.x clear all members of __m128 data type ( a union ) in Release configurations for 32-bit and 64-bit Windows platforms.

3. Not verified for all Linux or Mac versions of Intel C++ compiler. Everything is correct with Intel C++ compiler version 12.x. No verification are done for older versions of Intel C++ compiler.

4. Microsoft C++ compilers from Visual Studios 2005 and 2008 generated correct codes and passed all my tests.

Please take a look at two more posts with results. Thanks.

Application - IccTestApp - WIN32_ICC ( 64-bit ) - Debug
Tests: Start
> Test1017 Start <
Test-Case 1
1.000000
1.000000
1.000000
1.000000
Test-Case 2
1.000000
1.000000
1.000000
1.000000
Test-Case 3
1.000000
1.000000
1.000000
1.000000
Test Completed in 0 ticks
> Test1017 End <

Application - ScaLibTestApp - WIN32_MSC ( 64-bit ) - Debug
Tests: Start
> Test1017 Start <
Test-Case 1
1.000000
1.000000
1.000000
1.000000
Test-Case 2
1.000000
1.000000
1.000000
1.000000
Test-Case 3
1.000000
1.000000
1.000000
1.000000
Test Completed in 0 ticks
> Test1017 End <

Application - IccTestApp - WIN32_ICC ( 64-bit ) - Release - SergeyK comment - FAILED
Tests: Start
> Test1017 Start <
Test-Case 1
0.000000
0.000000
0.000000
0.000000
Test-Case 2
0.000000
0.000000
0.000000
0.000000
Test-Case 3
0.000000
0.000000
0.000000
0.000000
Test Completed in 0 ticks
> Test1017 End <

Application - ScaLibTestApp - WIN32_MSC ( 64-bit ) - Release
Tests: Start
> Test1017 Start <
Test-Case 1
1.000000
1.000000
1.000000
1.000000
Test-Case 2
1.000000
1.000000
1.000000
1.000000
Test-Case 3
1.000000
1.000000
1.000000
1.000000
Test Completed in 0 ticks
> Test1017 End <

Application - IccTestApp - WIN32_ICC ( 32-bit ) - Debug
Tests: Start
> Test1017 Start <
Test-Case 1
1.000000
1.000000
1.000000
1.000000
Test-Case 2
1.000000
1.000000
1.000000
1.000000
Test-Case 3
1.000000
1.000000
1.000000
1.000000
Test Completed in 0 ticks
> Test1017 End <

Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Debug
Tests: Start
> Test1017 Start <
Test-Case 1
1.000000
1.000000
1.000000
1.000000
Test-Case 2
1.000000
1.000000
1.000000
1.000000
Test-Case 3
1.000000
1.000000
1.000000
1.000000
Test Completed in 0 ticks
> Test1017 End <

Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release - SergeyK comment - FAILED
Tests: Start
> Test1017 Start <
Test-Case 1
0.000000
0.000000
0.000000
0.000000
Test-Case 2
0.000000
0.000000
0.000000
0.000000
Test-Case 3
0.000000
0.000000
0.000000
0.000000
Test Completed in 0 ticks
> Test1017 End <

Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release
Tests: Start
> Test1017 Start <
Test-Case 1
1.000000
1.000000
1.000000
1.000000
Test-Case 2
1.000000
1.000000
1.000000
1.000000
Test-Case 3
1.000000
1.000000
1.000000
1.000000
Test Completed in 0 ticks
> Test1017 End <

Sergey,

This is my config:

Processor:Intel(R) Core(TM) i7-4800MQ CPU @2.70GHz

Visual Studio 2010, Intel Compiler.

Debug mode it works fine, release mode results are attached.

Release Mode:

0.000000
0.000000
0.000000
0.000000

attached results:

Attachments: 

Eswar, In Command Prompt get an exact version string for Intel C++ compiler. Thanks.

Sergey,

I am using latest compiler.

Intel® C++ Compiler 14.0

Intel® C++ Composer XE 2013 SP1

Thanks,

Eswar Reddy K

>>I am using latest compiler.
>>
>>Intel® C++ Compiler 14.0
>>
>>Intel® C++ Composer XE 2013 SP1

Please use Microsoft C++ compiler to verify that in both configurations results are correct, that is, all ones.

I'll try to do additional verification of _mm_move_ss intrinsic function with MinGW version 4.8.1.

>>...I'll try to do additional verification of _mm_move_ss intrinsic function with MinGW version 4.8.1...

Everything is fine. Here is a test-case for GCC / ICPC for Linux or for MinGW for Windows:
...
__v4sf in = { 0.0f, 0.0f, 0.0f, 0.0f };
__v4sf inA = { 0.1f, 0.2f, 0.3f, 0.4f };
__v4sf inB = { 0.5f, 0.6f, 0.7f, 0.8f };
__v4sf out = { 0.0f, 0.0f, 0.0f, 0.0f };

// Version 1 - Input: ArgA = ArgB ( in, in )
in = _mm_set1_ps( 1.0f );
out = _mm_move_ss( in, in );

// Version 2 - Input: ArgA != ArgB ( inA, inB )
// out = _mm_move_ss( inA, inB );

// Test-Case 1
printf( "Test-Case 1\n" );
printf( "%f\n", out[0] );
printf( "%f\n", out[1] );
printf( "%f\n", out[2] );
printf( "%f\n", out[3] );

// Test-Case 2
printf( "Test-Case 2\n" );
printf( "%f\n", ( float )out[0] );
printf( "%f\n", ( float )out[1] );
printf( "%f\n", ( float )out[2] );
printf( "%f\n", ( float )out[3] );

// Test-Case 3
printf( "Test-Case 3\n" );
printf( "%f\n", ( float )( data32 )out[0] );
printf( "%f\n", ( float )( data32 )out[1] );
printf( "%f\n", ( float )( data32 )out[2] );
printf( "%f\n", ( float )( data32 )out[3] );
...

Outputs with MinGW v4.8.1:

Output for Version 1 - Input: ArgA = ArgB ( in, in ) - Debug

Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Debug
Tests: Start
> Test1017 Start <
Test-Case 1
1.000000
1.000000
1.000000
1.000000
Test-Case 2
1.000000
1.000000
1.000000
1.000000
Test-Case 3
1.000000
1.000000
1.000000
1.000000
Test Completed in 0 ticks
> Test1017 End <
Tests: Completed
Memory Blocks Allocated : 0
Memory Blocks Released : 0
Memory Blocks NOT Released: 0
Memory Tracer Integrity Verified - Memory Leaks NOT Detected

Output for Version 1 - Input: ArgA = ArgB ( in, in ) - Release

Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release
Tests: Start
> Test1017 Start <
Test-Case 1
1.000000
1.000000
1.000000
1.000000
Test-Case 2
1.000000
1.000000
1.000000
1.000000
Test-Case 3
1.000000
1.000000
1.000000
1.000000
Test Completed in 0 ticks
> Test1017 End <
Tests: Completed

Outputs with MinGW v4.8.1:

Output for Version 2 - Input: ArgA != ArgB ( inA, inB ) - Debug

Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Debug
Tests: Start
> Test1017 Start <
Test-Case 1
0.500000
0.200000
0.300000
0.400000
Test-Case 2
0.500000
0.200000
0.300000
0.400000
Test-Case 3
0.500000
0.200000
0.300000
0.400000
Test Completed in 0 ticks
> Test1017 End <
Tests: Completed
Memory Blocks Allocated : 0
Memory Blocks Released : 0
Memory Blocks NOT Released: 0
Memory Tracer Integrity Verified - Memory Leaks NOT Detected

Output for Version 2 - Input: ArgA != ArgB ( inA, inB ) - Release

Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release
Tests: Start
> Test1017 Start <
Test-Case 1
0.500000
0.200000
0.300000
0.400000
Test-Case 2
0.500000
0.200000
0.300000
0.400000
Test-Case 3
0.500000
0.200000
0.300000
0.400000
Test Completed in 16 ticks
> Test1017 End <

Eswar,

Your Release configuration option is /O2 and in order to keep it and at the same time to solve the problem here is a simple workaround:

...
#pragma intel optimization_level 0

void SomeFunction( ... )
{
...
// Your processing with _mm_move_ss intrinsic function
...
}
...

I verified the workaround for Release configuration of my test-cases with Intel C++ compiler v13.x and it works.

Thanks Sergey,

I have replaced _mm_move_ss() with vrz = _mm_shuffle_ps(vrx,vrx,_MM_SHUFFLE(3,2,1,0));

Its works fine for me.

Thanks for your time! Only questions remaining is why _mm_move_ss() fails in release mode with /O2, this is just for academic interest :)

Regards,

Eswar Reddy K

>>...Only questions remaining is why _mm_move_ss() fails in release mode with /O2, this is just for academic interest :)

Let's leave that question for Intel software engineers. :)

And now seriously...

Eswar,

I don't think that there is any problem with MOVSS instruction. I think there is, still Not confirmed, a bug in Intel C++ compiler when an application is compiled in Release configuration uses /O2 optimization and ArgA = ArgB for MOVSS instruction.

Leave a Comment

Please sign in to add a comment. Not a member? Join today