Two bad optimization problems with SSE

Two bad optimization problems with SSE

Hi,

I have found two setups where the SSE optimization breaks down and the compiler generates very very strange code.

Case 1: Unions with __m128
------------------------------
struct float4
{
union
{
__m128 data;
struct { float x, y, z, w; };
};

// overloaded operators like +, -, *, /, etc.
};

The thing is that even if I never use the members of the unnamed struct above, the compiler generates nonsense assembly for all code that operates on instances of the float4 struct. Of course I am using all optimization options and latest versions of both ICC and MSVC.

Case 2: Structs with __m128
------------------------------

struct Vec34
{
struct Vec3 { __m128 x, y, z; };

Vec3 containers[4];
};

When accessing members of the struct and doing some operations in them, like:

Vec34 v1;
// initialization
Vec34.containers[0].x = _mm_mul_ps(ret.containers[0].y, ret.containers[0].z);

the compiler again generates many and very messy instructions. Again, I've set all optimization options and latest version of compilers.

I have attached a ZIP file with test case code with a README.txt and comments in the source. I can work around the first problem by simply commenting the float struct out of the union and my application's performance increases with 20-40%, which is drammatic. However, I cannot work around the second problem :(

Now, the first particular test compiles well on Linux with the latest version, BUT the general problem appears there too because my application increases in performance even more than on Windows.

The second test fails on both platforms. Please, look at them and consider them. These are really basic constructs with SSE vectors, and I guess many people are affected by these problems.

Thank you in advance!

19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

As for the union, I am seeing only vector instructions so I am unable to reproduce the issue.

As for the struct, there is indeed a lot of scalar code involved which I have seen before with some of my code but it wasn't important enough to pursue it at that time.

Hint for Jennifer: what I find very ugly/problematic in generated code is memory copying done using a single register. If you undefine UNION_TEST and compile main.cpp using /FAs and search for ";;; const Container r = fun(6);" in the .ASM file you will see the problem right below the funny packing commented with ";199.33". Compiler has generated 96 MOV instructions of which 48 are reads and 48 are writes and all of them access consecutive stack addresses so a loop was most likely possible instead of such a waste of space and decoding bandwidth which I am sure are also detrimental to the code performance.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Indeed, these unpack/move instructions are driving me crazy! The messed-up assembly for both tests looks very similar, and I guess it's the same bug in the compiler that produces them.

As for the union test, Igor, did you compile under Linux or Windows? As I said above, unfortunately this particular union test doesn't fail on Linux, but the problem is obviously there too, because working around it speeds up my application very much.

I always compile in Windows and I used 10.1.020 version of ICC. As I said, for the union test I only saw vector instructions. If you could separate those two test cases, make then more compact and report them to Intel Premier Support that would be nice.

In the meantime you can work around those problems by using assembler code in a separate .asm file. Basically you write external functions in assembler and make them friends of your class. Then you call them from class member function (which serves as a wrapper) and pass this pointer explicitly as the first argument. If you are not familiar with what I am talking about, check Agner's optimization manuals.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Yes, for the union test only vector instructions are generated but did you notice that the number of instructions generated fof f1() are 3 times more than for f2() which essentially does the exactly same? Also, you can compare with the instructions generated by MSVC.

As for the Agner's optimization manuals, I know them, but there's never time to read them thoroughly :) I didn't know about this trick, thanks, I can try.

By the way, do I have a better chance for faster bug fixing if I send test cases to the premier support? I should assume so, but anyway?

Haven't compared those, will check again.

As for the trick I am mentioning it goes like this:

// Example 7.1b. Member function changed to friend function:

// An incomplete class declaration is needed here:
class MyList;

// Function prototype for friend function with 'this' as parameter:
extern "C" int MyList_Sum(MyList * ThisP);

// Class declaration:
class MyList {
protected:
	int length; // Data members:
	int buffer[100];
public:
	MyList(); // Constructor
	void AttItem(int item); // Add item to list

	// Make MyList_Sum a friend:
	friend int MyList_Sum(MyList * ThisP);

	// Translate Sum to MyList_Sum by inline call:
	int Sum()
	{
		return MyList_Sum(this);
	}
};

Reason for doing it that way is twofold — first, you need to use pure assembler because inline assembler is not supported in x64 and second, using pure assembler to write C++ member functions is not portable because of different name mangling and different methods of passing this pointer between various compilers.

As for reporting the issue, each issue you want to see fixed should be reported to Premier Support unless it can be considered show-stopper for everyone — in that case reporting it here may help others and avoid duplicate reports. I have noticed so far that such cases get picked up by Intel engineers who frequent the forum and submitted internally so they seem to get some sort of priority treatment but I wouldn't count on that for each and every case.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Igor,

The .zip file got lost. Could you reattach it? Thanks!

As far as getting a faster fix, it depends on the fix itself - how difficult it is, how big change it is needed.....

About issue reporting, one benefit of Premier Support is that it's a secure website.But it's ok to post here so others might be benefit from it.

I am reattaching it. Hopefully this time it works.

Please, report back whether you have been able to reproduce the bug.

Attachments: 

AttachmentSize
Download ICCTest.zip7.06 KB

Thanks "iliyan" and Igor.

I duplicated the problems. Our engineer will be looking into it soon. When there's a fix or work-around, I'll post it here.

This great to hear/read! Thank you and thank Igor for the help again!

Both of you are welcome.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Thanks for reporting these issues.

I thought I'd fill you in on the problems, and also ask a quick question.

The unpack's are pretty ugly, but we generate them on purpose (believe it or not). If you notice, we chose to lower the struct copy using field-by-field 4byte copies. The destination is then referenced as a 128bit load. This is a store forwarding issue, since the hardware can't forward small stores into one large load (see optimization guide for more info), resulting in a performance penalty. We use ldss/unpacks to avoid this issue.

I'm not saying this is the best way to do it, but it's as good as we could do with field-by-field copies. We're taking a look now, and will probably try to change the struct copy to do a full 128bit copy, in this case, to (a) do a faster copy, and (b) do a faster load.

As for the inefficient mempcy (lots of inlined 4bytes), I agree that this can be done better in a smaller loop. I'm not sure how likely we are to fix this issue in our product (since it's primarily a code-size issue), but it's something that's hopefully going to be addressed in the future.

The question I have, however, is that I see you're using SSE* intrinsics, but are you also using any options equal to, or greater than -QxW? If so, I think those copies should all get optimized into 128bit (or at least 8byte) copies.

Thanks,

Zia Ansari.

Hi MADzansari,

Thanks for looking into the problem.

Unfortunately, I am not an expert in architectures, and cannot understand thoroughly what you are explaining. The thing is that I noticed strange compiler behavior for the code I sent.

As I've played with the union test, it turns out that even a

union { __m128 vec1; __m128 vec2; }

confuses the compiler completely, even though I use only vec1 from this union. Actually, no matter what I put a __m128 in a union with, the compiler generates messy code. So from a developer point of view, if you put __m128 in a union, the compiler cannot generate good code. This has nothing do to, IMHO, with particular load/copy issues. When I compile and look at the disassembly, I see 110 instructions generated for the function f1. If I remove vec2 from the union (the test code doesn't use it at all!), when I compile and look at the disassembly, I see 37 instructions for function f1. And I think, this should be a bug somewhere.

As for the second struct test, depending on how you *express* in C++ the same data alignment, the compiler generates completely different, but with the same pattern messy, instructions. And I think, this should be a bug somwhere else, or even the same bug.

And yes, I've tried using /QxW and everything above. Same result. I've run the test on a Pentium M and on a Core 2 Duo Penryn.

My general problem is that I simply cannot have a class like this:

struct float4
{
union
{
__m128 data;
struct { x, y, z, w; };
}
}

to have overloaded operators for it, and to work with it like this:

float4 f41, f42;
f41.x = 0.1f; f42.x = 0.1f; ...

float4 sum = f41 + f42;

And what's even stranger... I have a copy of some Intel SIMD math library, where there is a SPVector class that uses exactly the same union above! And it's from 2001! How does the optimizer not optimize this well?

Zia,

movss and 2x unpcklps can be replaced with movss and shufps. It is more compact and easier to follow when looking at disassembly.

I have also noticed that Intel compiler sometimes manually unpacks constants at compile time thus wasting memory — is the speed benefit of single movaps over movss and 2x unpcklps so great to justify such wastefullness considering that it is usually a one-time operation outside of the loop?

As for the copying it is unacceptable and it must be fixed by converting it into a loop. Just imagine such a snail memcpy() code inside of a function that gets called 1,000,000 times.

I saw those issues while compiling with -QxS since I own Core 2 Duo E8200.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Igor & Iliyan,

Yes, I agree that the union case is a bug, no matter what fields you put into it. The fix, on our side, is being worked on, that should get rid of all the unecessary code, leaving just 128bit operations.

As for the struct copy, I agree that the 4byte copies are ugly. I'm just having a hard time reproducing it with the given switches. With your testcase, with "UNION_TEST" undef'ed, when I build with "icl main.cpp -c -Fa -QxW (or anything above)", I see:

>>> const Container r = fun(6);

$B1$178: ; Preds $B1$178 $B1$3
movaps xmm0, XMMWORD PTR [-376+ebp+eax] ;196.33
movaps XMMWORD PTR [-568+ebp+eax], xmm0 ;196.33
movaps xmm1, XMMWORD PTR [-392+ebp+eax] ;196.33
movaps XMMWORD PTR [-584+ebp+eax], xmm1 ;196.33
movaps xmm2, XMMWORD PTR [-408+ebp+eax] ;196.33
movaps XMMWORD PTR [-600+ebp+eax], xmm2 ;196.33
movaps xmm3, XMMWORD PTR [-424+ebp+eax] ;196.33
movaps XMMWORD PTR [-616+ebp+eax], xmm3 ;196.33
sub eax, 64 ;196.33
jne $B1$178 ; Prob 90% ;196.33

Without -Qx?, I see the big string of 4 byte load/stores.

Perhaps I'm doing something wrong?

Thanks,

Zia.

Hi MADzansari,

Yes, the fragment you pasted is ok. However,
look at the block just above and compare the assembly when
PROBLEMMATIC_CASE is defined and not defined. Just compile the two
versions with the command-line you use and put the assembly output in a
diff tool :)

This is exactly why I put the "int 3" instructions,
so that you can compare what's happening inbetween them. The
instructions around the SSE multiplication are messy.

I'm
attaching the assembly generated between the two "int 3" instructions
in a ZIP, where you can clearly see what is wrong. I am running Windows
Vista with VS2008 and ICC 10.1.020, and with "icl main.cpp -c -Fa -QxW"
I can clearly see differences.

Best,
Iliyan

Attachments: 

AttachmentSize
Download assembly.zip753 bytes

Hi Iliyan,

Got it. Yes the issue above stems from the same inefficient struct copy problem that I explained earlier. The fix for all of these issues should be the same (for us). This is what I see in my workspace with a temp fix :

$B1$2: ; Preds $B1$180
; Begin ASM
int 3 ;194.9
; End ASM
; LOE
$B1$3: ; Preds $B1$2
movaps xmm0, XMMWORD PTR [-312+ebp] ;196.33
movaps XMMWORD PTR [-360+ebp], xmm0 ;196.33
movaps xmm1, XMMWORD PTR [-296+ebp] ;196.33
movaps XMMWORD PTR [-344+ebp], xmm1 ;196.33
movaps xmm2, XMMWORD PTR [-280+ebp] ;196.33
movaps XMMWORD PTR [-328+ebp], xmm2 ;196.33
; LOE
$B1$4: ; Preds $B1$3
movaps xmm0, XMMWORD PTR [-344+ebp] ;196.33
mulps xmm0, XMMWORD PTR [-328+ebp] ;196.33
movaps xmm1, XMMWORD PTR [-296+ebp] ;196.33
&
nbsp; mulps xmm1, XMMWORD PTR [-280+ebp] ;196.33
movaps XMMWORD PTR [-360+ebp], xmm0 ;196.33
movaps XMMWORD PTR [-312+ebp], xmm1 ;196.33
mov eax, 192 ;196.33
; LOE eax
$B1$179: ; Preds $B1$179 $B1$4
movaps xmm0, XMMWORD PTR [-376+ebp+eax] ;196.33
movaps XMMWORD PTR [-568+ebp+eax], xmm0 ;196.33
movaps xmm1, XMMWORD PTR [-392+ebp+eax] ;196.33
movaps XMMWORD PTR [-584+ebp+eax], xmm1 ;196.33
movaps xmm2, XMMWORD PTR [-408+ebp+eax] ;196.33
movaps XMMWORD PTR [-600+ebp+eax], xmm2 ;196.33
movaps xmm3, XMMWORD PTR [-424+ebp+eax] ;196.33
movaps XMMWORD PTR [-616+ebp+eax], xmm3 ;196.33
sub eax, 64 ;196.33
jne $B1$179 ; Prob 90% ;196.33
&n
bsp; ; LOE eax
$B1$5: ; Preds $B1$179
; Begin ASM
int 3 ;198.9
; End ASM

Thanks,

Zia.

Zia, wouldn't it be better if those loads and stores were not interleaved like that? If I remember correctly Intel's optimization manual advises against interleaving of loads and stores.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Great! Thanks, Zia.

I suspected the issue would be the same for both cases - less work for you and (I hope) less waiting for me :)

Best,
Iliyan

Leave a Comment

Please sign in to add a comment. Not a member? Join today