BUG: Poor hybrid SSE/AVX code generated

BUG: Poor hybrid SSE/AVX code generated

Hi,

I have a piece of code that I cannot disclose right now (I will try to reproduce it in a shorter example), the thing is when I compile it with /QAVX, it generate this code:

Address	Source Line	Assembly	Clockticks: Total	Clockticks: Self	Instructions Retired: Total	Instructions Retired: Self	CPI Rate: Total	CPI Rate: Self	General Retirement	Microcode Sequencer	Bad Speculation	Back-end Bound	Front-end Bound	DTLB Overhead	Loads Blocked by Store Forwarding	Split Loads	4K Aliasing	L2 Bound	L3 Bound	DRAM Bound	Store Bound	Core Bound	ICache Misses	ITLB Overhead	Branch Resteers	DSB Switches	Length Changing Prefixes	Front-End Bandwidth DSB	Front-End Bandwidth MITE
0x100b931e	962	vmovdqu xmm5, xmmword ptr [ebx]	0.0%	4,983,341	0.0%	6,386,398	0.780	0.780	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9322	962	vmovdqu xmmword ptr [esp+0xa0], xmm5	1.8%	554,297,692	1.4%	698,936,006	0.793	0.793	0.130	0.000	0.000	0.846	0.024	0.031	0.000	0.000	0.020	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.032	0.000
0x100b932b	962	vmovdqu xmm0, xmmword ptr [ebx+0x10]	0.1%	22,394,000	0.1%	28,464,010	0.787	0.787	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.322	0.000	0.000	0.000	0.000	0.001	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9330	962	vmovdqu xmmword ptr [esp+0xb0], xmm0	0.0%	6,031,618	0.0%	8,129,718	0.742	0.742	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9339	962	test esi, esi	0.0%	14,815,452	0.0%	16,931,739	0.875	0.875	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b933b	962	jz 0x100b941e <Block 50>	0.0%	1,904,153	0.0%	2,000,911	0.952	0.952	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9341		Block 49:	0.0%		0.0%																								
0x100b9341	962	mov edi, dword ptr [esp+0x54]	0.0%		0.0%																								
0x100b9345	962	vpaddw xmm3, xmm5, xmm4	0.0%	3,352,093	0.0%	4,271,375	0.785	0.785	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9349	962	vpaddw xmm2, xmm0, xmm4	0.0%	2,928,452	0.0%	3,751,111	0.781	0.781	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b934d	962	mov eax, dword ptr [edi+0x4]	0.0%	1,376,764	0.0%	1,987,314	0.693	0.693	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9350	962	vpextrw edi, xmm3, 0x0	0.0%		0.0%																								
0x100b9355	962	vmovdqu xmm7, xmmword ptr [esp+0x80]	0.0%	4,592,970	0.0%	5,534,012	0.830	0.830	0.983	0.000	0.983	0.017	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b935e	962	vmovdqu xmm1, xmmword ptr [esp+0x90]	0.0%		0.0%																								

When I generate it with /QSSE4.1 /QaAVX, the AVX code path is as such:

Address    Source Line    Assembly    Clockticks: Total    Clockticks: Self    Instructions Retired: Total    Instructions Retired: Self    CPI Rate: Total    CPI Rate: Self    General Retirement    Microcode Sequencer    Bad Speculation    Back-end Bound    Front-end Bound    DTLB Overhead    Loads Blocked by Store Forwarding    Split Loads    4K Aliasing    L2 Bound    L3 Bound    DRAM Bound    Store Bound    Core Bound    ICache Misses    ITLB Overhead    Branch Resteers    DSB Switches    Length Changing Prefixes    Front-End Bandwidth DSB    Front-End Bandwidth MITE
0x100b9594        Block 47:    0.0%        0.0%                                                                                                
0x100b9594    962    mov dword ptr [esp+0x8], eax    0.0%    2,106,301    0.0%    2,485,774    0.847    0.847    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b9598    962    xor edx, edx    0.0%        0.0%                                                                                                
0x100b959a    962    mov esi, dword ptr [esp+0x28]    0.0%        0.0%                                                                                                
0x100b959e        Block 48:    0.0%        0.0%                                                                                                
0x100b959e    962    movdqu xmm5, xmmword ptr [ebx]    0.0%    6,178,264    0.0%    7,860,224    0.786    0.786    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b95a2    962    movdqa xmmword ptr [esp+0xa0], xmm5    1.7%    518,966,716    1.3%    644,970,879    0.805    0.805    0.176    0.000    0.019    0.805    0.019    0.012    0.000    0.000    0.035    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.125    0.000
0x100b95ab    962    movdqu xmm0, xmmword ptr [ebx+0x10]    0.1%    18,849,972    0.0%    22,961,024    0.821    0.821    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.239    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b95b0    962    movdqa xmmword ptr [esp+0xb0], xmm0    0.0%    11,067,064    0.0%    14,796,040    0.748    0.748    0.000    0.000    0.000    1.000    0.000    0.569    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b95b9    962    test esi, esi    0.0%    13,519,245    0.0%    16,668,546    0.811    0.811    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.333    0.000    0.000    0.000    0.000    0.266    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b95bb    962    jz 0x100b96a1 <Block 50>    0.0%    1,146,115    0.0%    1,366,417    0.839    0.839    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b95c1        Block 49:    0.0%        0.0%                                                                                                
0x100b95c1    962    mov edi, dword ptr [esp+0x54]    0.0%    7,793,002    0.0%    9,372,588    0.831    0.831    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b95c5    962    vzeroupper     0.0%        0.0%                                                                                                
0x100b95c8    962    vpaddw xmm3, xmm5, xmm4    0.0%    11,755,160    0.0%    15,379,015    0.764    0.764    0.768    0.000    0.768    0.232    0.000    1.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b95cc    962    vpaddw xmm2, xmm0, xmm4    0.0%    2,202,864    0.0%    2,551,664    0.863    0.863    0.000    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b95d0    962    mov eax, dword ptr [edi+0x4]    0.0%    1,229,030    0.0%    1,534,467    0.801    0.801    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b95d3    962    vpextrw edi, xmm3, 0x0    0.0%    4,683,935    0.0%    6,584,619    0.711    0.711    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b95d8    962    vmovdqa xmm7, xmmword ptr [esp+0x80]    0.0%    5,390,732    0.0%    7,425,202    0.726    0.726    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000
0x100b95e1    962    vmovdqa xmm1, xmmword ptr [esp+0x90]    0.0%    4,387,615    0.0%    5,483,645    0.800    0.800    0.000    0.000    0.000    0.000    1.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000

Note that I had to put manually the _mm256_zeroupper(), to workaround the huge penalty effected by this generated code.
I think in that case it should only generate VEX functions...

I'll try to reproduce it, but you really should investigate into it.

Best regards

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Here is a code portion to reproduce the supposed bug.

#include <immintrin.h>

typedef unsigned short int16_t;

__declspec(noinline)
static bool SomeFunction(int16_t * outputPtr, int16_t const * inputPtr)
{
    for (int x = 0; x < 16; x++)
    {
        __m256 input    = _mm256_loadu_ps((float *)(inputPtr + 16 * x));
        __m128i in_low  = _mm_castps_si128(_mm256_extractf128_ps(input, 0));
        __m128i in_high = _mm_castps_si128(_mm256_extractf128_ps(input, 1));

        __m128i out_low  = _mm_add_epi16(in_low, _mm_set1_epi16(1));
        __m128i out_high = _mm_add_epi16(in_low, _mm_set1_epi16(1));

        __m256 output = _mm256_insertf128_ps(_mm256_insertf128_ps(_mm256_undefined_ps(), _mm_castsi128_ps(out_low), 0), _mm_castsi128_ps(out_high), 1);
        _mm256_store_ps((float *)(outputPtr + 16 * x), output);
    }

    return true;
}

#include <stdlib.h>
#include <stdio.h>

int main(int argc, char ** argv)
{
    int16_t * dest = (int16_t *)_aligned_malloc(65536*sizeof(int16_t), 32);
    int16_t * src = (int16_t *)_aligned_malloc(65536*sizeof(int16_t), 32);

    // Prevent input optimisations
    for (int i = 0; i < 32768; i++)
        src[i] = rand();

    SomeFunction(dest, src);

    // Prevent output optimisations
    for (int i = 0; i < 32768; i++)
        printf("%d\n", dest[i]);
}

Here is the assembler code generated with /QxAVX for "SomeFunction":

01271090  push        esi  
01271091  sub         esp,18h  
01271094  xor         ecx,ecx  
01271096  vmovdqu     xmm0,xmmword ptr [___xi_z+34h (1273120h)]  
0127109E  mov         esi,eax  
012710A0  xor         eax,eax  
012710A2  vmovups     xmm1,xmmword ptr [eax+edx]  
012710A7  inc         ecx  
012710A8  vinsertf128 ymm2,ymm1,xmmword ptr [eax+edx+10h],1  
012710B0  vpaddw      xmm4,xmm2,xmm0  
012710B4  vinsertf128 ymm5,ymm4,xmm4,1  
012710BA  vmovups     ymmword ptr [eax+esi],ymm5  
012710BF  add         eax,20h  
012710C2  cmp         ecx,10h  
012710C5  jl          SomeFunction+12h (12710A2h)  
012710C7  vzeroupper  
012710CA  add         esp,18h  
012710CD  pop         esi  
012710CE  ret  

 

Now with /QxSSE4.1 /QaxAVX (in the AVX code path):

SomeFunction:
008D1090  push        esi  
008D1091  sub         esp,18h  
008D1094  xor         ecx,ecx  
008D1096  movdqa      xmm0,xmmword ptr [___xi_z+34h (8D3120h)]  
008D109E  mov         esi,eax  
008D10A0  xor         eax,eax  
008D10A2  vmovups     ymm1,ymmword ptr [eax+edx]  
008D10A7  inc         ecx  
008D10A8  vpaddw      xmm3,xmm1,xmm0  
008D10AC  vinsertf128 ymm4,ymm3,xmm3,1  
008D10B2  vmovaps     ymmword ptr [eax+esi],ymm4  
008D10B7  add         eax,20h  
008D10BA  cmp         ecx,10h  
008D10BD  jl          SomeFunction+12h (8D10A2h)  
008D10BF  add         esp,18h  
008D10C2  pop         esi  
008D10C3  ret 

There SHOULD not be a MOVDQA but a VMOVUPS (or VMOVDQA) when loading into register the constant "_mm_set1_epi16(1)"

If i change SomeFunction by removing the for loop (no factorisation of the _mm_set1_epi16(1) needed), it gets normal again:

SomeFunction:
01101090  sub         esp,1Ch  
01101093  vmovups     ymm0,ymmword ptr [edx]  
01101097  vpaddw      xmm2,xmm0,xmmword ptr [___xi_z+34h (1103120h)]  
0110109F  vinsertf128 ymm3,ymm2,xmm2,1  
011010A5  vmovaps     ymmword ptr [eax],ymm3  
011010A9  add         esp,1Ch  
011010AC  ret  

This bug is really annoying. Especially since /QaxAVX is mandatory when we want to mix SSE4 and AVX code in the same binary, since it's not possible with current versions of Intel Compiler to compile just one cpp file with /QxAVX without "border effects" (since it pollute with AVX every STL classes for instance or any shared classes between the 2 cpp).

It took some time to reproduced it, so I hope it will be taked into consideration.

Best Regards

And i almost forgot:

Workaround is to insert _mm256_zeroupper() depending on where the AVX and SSE instructions are emitted (so it is a little bit empirical....).

But really, when using intrinsics and Intel Compiler, we shouldn't have to add these (plus it might throw away useful info in the register that the processor would have to reload).

Hello Emma,

You are right .

As to 'since it's not possible with current versions of Intel Compiler to compile just one cpp file with /QxAVX without "border effects" (since it pollute with AVX every STL classes for instance or any shared classes between the 2 cpp).'

When using -xavx or /arch:AVX ,it is known that "A disadvantage of this method is that it requires access to the relevant source files, so it cannot avoid AVX-SSE transitions resulting from calls to functions that are not compiled with the –xavx or –mavx flag. Another possible disadvantage is that all Intel® SSE code within a file compiled with the –xavx or –mavx flag will be converted to VEX format and will only run on Intel® AVX supported processors."(https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf)

The Intel compiler, when /arch:AVX is set so as to support AVX intrinsics, generates equivalent AVX-128 code from SSE intrinsics, so there should be no transition penalty. So 'There SHOULD not be a MOVDQA but a VMOVUPS ' should be a compiler bug's failue to deal with this ,I will investigate this and will report a internal bug if it is confirmed.

 

 

Thank you.
--
QIAOMIN.Q
Intel Developer Support
Please participate in our redesigned community support web site:

User forums:                   http://software.intel.com/en-us/forums/

 

 

Hi,

I find your example code interesting, as I've kind of assumed, possibly wrongly, that its bad to mix SSE with AVX code.

If I dont' have all the functionality I want in AVX, then I only use SSE on the whole piece of code, I never mix, precisely because of issues like the SSE performance penalty if you don't clear AVX upper regs.

Hi Richard,

I have code mixing float and fixed point precision, and in that matter, mixing AVX and SSE offer a very good performance improvement that I cannot lost by moving backward to SSE4.1 only.

The problem is that I try to have the same binary having a SSE4.1 optimized code path and a SSE4.1/AVX hybrid optimized code path (because of my customer requirement) which is tedious (its not my first problem with the compiler). FYI the hybrid code is 20% faster to give you an idea (when of course I fix all the penalties).

And this allows me to notice that this /QxSSE4.1 /QxaAVX combination is very very much flawed... (I guess IPO wrongly mixes part of code that should be isolated...).

Best Regards

I haven't worked with the multiple path build. In a single path build (e.g. -QxCORE-AVX2) ICL translates SSE4 intrinsics automatically to AVX-128 and skips _mm256_zeroupper() when it becomes unnecessary.

gcc doesn't translate to AVX and requires a block of SSE code to be followed by
#ifdef __AVX__
_mm256_zeroupper();
#endif
to maintain performance.

I avoid IPO unless I have a case which benefits from it.

Intel 15.0 beta compiler and gcc 4.9 make more effective use of shuffles in translation of C and Fortran so have less need of intrinsics.

Hi Tim,

This issue is specific to the fact that AVX is an alternate code path.

Regards

Leave a Comment

Please sign in to add a comment. Not a member? Join today