Optimization code generation issue. XE 2013 Upgrade 3

Optimization code generation issue. XE 2013 Upgrade 3

Hi,

I would like to know what could be the reason for the compiler to generate such assembly code.

I am using the Intel(R) Composer XE 2013 Update 3 (package 171) under Windows 7 64 bit.

Flags : /EHa /GR /O3 /Oi /Ot /Oa- /Qip /QxSSE4.2 /Qstd=c++0x /Z7

Why the hell is is reading the _ptData variable so many times? (first in rax then going into r8 ,r9 ,r10 and finally back into rax!!!)

I do have the impression that it is falling back to debug code here.

And just in advance I can't provide some special code to reproduce it (I don't have the time to try to reproduce the issue) . I just want some general advices if possible about things I could try to remediate the issue.

Thanks.

Laurent

000007FEECB621CE 48 89 85 A8 00 00 00 mov         qword ptr [_ptData],rax
                        _ptData.common_data = &localData->common_data;;
000007FEECB621D5 48 8B 95 A8 01 00 00 mov         rdx,qword ptr [localData]
000007FEECB621DC 48 8D 4A 70      lea         rcx,[rdx+70h]
000007FEECB621E0 4C 8B 85 A8 00 00 00 mov         r8,qword ptr [_ptData]
000007FEECB621E7 49 89 48 10      mov         qword ptr [r8+10h],rcx
                        _ptData.x = x;
000007FEECB621EB 4C 8B 8D A8 00 00 00 mov         r9,qword ptr [_ptData]
000007FEECB621F2 66 45 89 61 28   mov         word ptr [r9+28h],r12w
                        _ptData.y = y;
000007FEECB621F7 4C 8B 95 A8 00 00 00 mov         r10,qword ptr [_ptData]
000007FEECB621FE 66 41 89 72 2A   mov         word ptr [r10+2Ah],si
                        _ptData.frameBuffer = frameBufferOrigin +y *bpl + x * bpp;
000007FEECB62203 44 8B 5D 0C      mov         r11d,dword ptr [bpl]
000007FEECB62207 41 0F AF F3      imul        esi,r11d
000007FEECB6220B 48 63 F6         movsxd      rsi,esi
000007FEECB6220E 48 03 B5 90 00 00 00 add         rsi,qword ptr [pos_delta]
000007FEECB62215 8B 45 10         mov         eax,dword ptr [bpp]
000007FEECB62218 44 0F AF E0      imul        r12d,eax
000007FEECB6221C 4D 63 E4         movsxd      r12,r12d
000007FEECB6221F 49 03 F4         add         rsi,r12
000007FEECB62222 48 8B 85 A8 00 00 00 mov         rax,qword ptr [_ptData]
000007FEECB62229 48 89 70 20      mov         qword ptr [rax+20h],rsi

22 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

>>...Why the hell is is reading the _ptData variable so many times?..

What if the compiler unrolls some C/C++ for loop?

The code is in a loop but the loop is not unrolled as far as I can see. Since the number of steps of the loop is computed just before I don't think that it will unroll it anyway.

I have also put that code inside its own block to make clear the variable lifetime.

For loop here //

{

                    {
                        MyDataStruct *_ptData = cached_points_ptr+cache_index;
                        _ptData->common_data = &localData->common_data; //pointer
                        _ptData->x = x; //ushort
                        _ptData->y = y; //ushort
                        _ptData->frameBuffer = frameBufferOrigin +y *bpl + x * bpp; //pointer
                        _ptData->zbuffer = zbuffer; //pointer
                        _ptData->color = col1; //F32vec4
                        _ptData->z0 = z0; //float
                        if ( use_texture )
                        {
                            _ptData->u0 = ((const float *)&uv1)[3]; //uv1 is a F32vec4
                            _ptData->v0 = ((const float *)&uv1)[2];
                        }
                    }
}

We are not calling the icl directly but via an ant script so I wonder if the order of my options have something to do with that.

I am going to try to remove the Z7 and see what code it produces without it.

Ok it has nothing to do with Z7 being on the command line. I removed it and the code is still the same.

At least that is a relief because I would have hated having to remove the debug info right now.

Take into account that you have three /O3 ( agressive one ), /Oi and /Ot options in order to increase speed of execution and Intel C++ compiler will try to minimize overheads.

I agree that option /Z7 is Not relevant in that case.

Actually I have tried with /O2 or /O3 and without /Ot for example and the generated code didn't change at all.

Even with O1 I would expect the multiple read of the structure pointer to go away, this is definitively not advanced optimizations. (Some code that I have posted have not appeared yet apparently).

I really wonder if there isn't something else there somewhere obvious that I am missing...

Now here is an interesting development.

If I put my simple code into a function then the code writing in my struct gets optimized correctly.

I have extra code because of the parameters being passed and read in the function but the code setting the data is not reading the data pointer for every member write. How strange...

Trying to inline the same function to see if it makes a difference now.

Unfortunately inlining the function result in the same unoptimized code...

I wonder if I need to try to change some inlining depth options too...

>>Now here is an interesting development.
>>
>>If I put my simple code into a function then the code writing in my struct gets optimized correctly...
>>...
>>Unfortunately inlining the function result in the same unoptimized code...
>>
>>I wonder if I need to try to change some inlining depth options too...

Please provide a test case that reproduces the problem or issue. Unfortunately, I don't think that somebody will try to reconstruct C/C++ codes from assembler codes posted in the initial post.

I will try to write a simple app doing the same thing but I can't guarantee that it will lead to the same generated code.

Laurent.

Ok I have found the single line that toggle the optimization off.

No idea why it does it, I am trying to write some code to reproduce it. I had almost the same code except that line and it was compiling fine.

That line is just using a class storing 2 ints (Size2 class) and I can't see anything that could cause the issue.

20% speed improvement just by fixing that. I wonder how often that occurs in our code now.

I will try to prepare something tomorrow.

>>That line is just using a class storing 2 ints (Size2 class) and I can't see anything that could cause the issue.
>>
>>20% speed improvement just by fixing that. I wonder how often that occurs in our code now.

We recently discussed a case when /O3 compiler option and a really small modification in a C/C++ code negatively affected performance by a similar number. ( VTune showed that it was a cache related issue ).

>>
>>I will try to prepare something tomorrow.

Thanks for the update and a complete test case should help to understand what is wrong, or possibly wrong in a code you've implemented or in Intel C++ compiler ( I'm leaving a small possibility for that as well ).

Just to be clear the issue is happening with O2 too. And the code works in both case, it is just slower :)

The interesting part is that the first two lines of my function have almost no relation to the rest of it and the variable causing the issue is not even used after in the function but affect all the code generated.

I should be able to extract something that compiles and shows the issue but it won't run. The code extract is part of something much bigger that would take too long to separate from the full code.

More tomorrow I hope.

Ritratto di Jennifer J. (Intel)

so icl does fine with simple testcase. if you could send the preprocessed .i file, it would be great. please note the line that is causing the problem.

Or report it to Intel Premier Support.

thank you,

Jennifer

Ok I have been able to create a small project showing the issue.

I will make an archive of it. To whom should I send it? Since it contains code that I don't really want to publish in the forum I would prefer if someone could send me a private message with the email address.

Or should I report it to Intel Premier Support instead?

Just be clarify again. the ICL is not doing fine on my simple testcase either :) so you should be able to investigate what is going wrong.

There are three options:

1. E-mail to Jennifer using Private Messaging system ( Recommended / Already aware of the problem )
2. Submit to Intel Premier Support
3. Upload to the thread ( in a new post ) and then everybody will be able to look into it

Thanks for all your efforts.

Ok going for option 1.

Thanks everybody.

Actually it is probably related to the use of /EHa... Moving to /EHs fixes the issue too.

That's strange because /EH '...Specifies the model of exception handling to be used by the compiler...' and the suffix s means:

'...The exception-handling model that does not catch asynchronous exceptions and tells the compiler to assume that extern C functions do throw an exception...'

Before it was with a suffix a and it means:

'...The exception-handling model that catches asynchronous (structured) exceptions and tells the compiler to assume that extern C functions do throw an exception...'.

Do you have any try-catch blocks in your codes?

I know and no the code I provided to Jennifer doesn't have a simple bit throwing exception or catching them :)

Which is why I was also surprised to see the change. I can PM you the archive if you want to see the code.

Laurent

single. not simple...

Accedere per lasciare un commento.