AVX intrinsics: strange output with Intel XE 2011

AVX intrinsics: strange output with Intel XE 2011

I have remarkedin production code that often a single _mm256_load_ps leads to severalloads in the ASMdump (indexed addressing for one operand), even if there isn't high register pressure, i.e. the same in 64-bit,the simplified (and otherwise meaningless) example below shows well the effect:


void Strange(float *dst, const float *src, int size)

{

  const __m256 k1 = _mm256_set1_ps(10.0), k2 = _mm256_set1_ps(20.0) , k3 = _mm256_set1_ps(30.0), k4 = _mm256_set1_ps(40.0);

  for (int i=0; i

.B8.3::                         ; Preds .B8.3 .B8.2
        vmulps    ymm5, ymm3, YMMWORD PTR [rdx+rax*4]           ;440.55
        vmulps    ymm6, ymm2, YMMWORD PTR [rdx+rax*4]           ;440.75
        vaddps    ymm0, ymm5, ymm6                              ;440.41
        vmulps    ymm5, ymm4, YMMWORD PTR [rdx+rax*4]           ;441.55
        vmulps    ymm6, ymm1, YMMWORD PTR [rdx+rax*4]           ;441.75
        vaddps    ymm5, ymm5, ymm6                              ;441.41
        vaddps    ymm0, ymm0, ymm5                              ;440.27
        vmovups   YMMWORD PTR [rcx+rax*4], ymm0                 ;440.21
        add       rax, 8                                        ;437.25
        cmp       rax, r8                                       ;437.19
        jl        .B8.3         ; Prob 82%                      ;437.19


I was expecting a single move like vmovups ymm7, YMMWORD PTR [rdx+rax*4] at the start of the loop, then ymm7 used 4 times instead of 4 times a load, I'm missing something here?

7 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

did you check and see such code any different by time during the execution?

it might be not always about register pressure but also about calling conventions and register usage on the different OS and 32/64bits binaries ...

on Linux and with icc version 12.1.0, I have , as you wanted to have:

..B2.3:                         # Preds ..B2.1 ..B2.3
        vmovups   (%rsi,%rax,4), %ymm6                       
        vmulps    %ymm6, %ymm3, %ymm4                        
        vmulps    %ymm6, %ymm2, %ymm5                        
        vmulps    %ymm6, %ymm1, %ymm7                        
        vmulps    %ymm6, %ymm0, %ymm8                        
        vaddps    %ymm5, %ymm4, %ymm9                        
        vaddps    %ymm8, %ymm7, %ymm10                       
        vaddps    %ymm10, %ymm9, %ymm11                      
        vmovups   %ymm11, (%rdi,%rax,4)                      
        addq      $8, %rax                                   
        cmpq      %rdx, %rax                                 
        jl        ..B2.3         

>did you check and see such code any different by time during the execution?

At the moment I wasn't able to compare the timings with another variant since the compiler always generate the same ASMcode whatever compilation flagsI tried, if you have an idea of which flag may impact this (Windows version) I'll be really glad to know it. Unfortunately it isn't an option to modify the ASM dump by hand then to use it as an input to the assembler since the syntax isn't correct due to a bug with the labels (the Intel compiler isn't compatible with itself in this respect). So, for this test I'll have to resort to purely assembly code or inline assembly which I avoid like the plague since several years now.

>on Linux and with icc version 12.1.0, I have , as you wanted to have

thanks, very interesting, now that's even more strange, one of the two version should be better and should be used by both compilers if you ask me (I don't see how the different ABIs have an impact on this, when used in a critical loop), even if the timings are the same one version should have better code density and better power usage, well IMHO

Forthe Windowsvariant, Ican imagine thatthe CPU is smart enough to not reload4 timesthe data if thecache linewasn't modified by another thread but it will still need to probe the L1D cache which looks less power efficient, I'll love to see this commented by a CPU designer

we are going out of AVX but into some compiler issue,

Windows and ICC 12.0 (not very latest) version gives the following:

;;; 		const __m256 x = _mm256_load_ps(src+i);   

  00036 c5 fc 10 2c 86   vmovups ymm5, YMMWORD PTR [esi+eax*4]  
$LN27:

;;; 		_mm256_store_ps(dst+i,_mm256_add_ps(_mm256_add_ps(_mm256_mul_ps(k1,x),_mm256_mul_ps(k2,x)),  

  0003b c5 dc 59 f5      vmulps ymm6, ymm4, ymm5               
$LN28:
  0003f c5 e4 59 fd      vmulps ymm7, ymm3, ymm5               
$LN29:
  00043 c5 cc 58 c7      vaddps ymm0, ymm6, ymm7               
$LN30:
  00047 c5 ec 59 f5      vmulps ymm6, ymm2, ymm5               
$LN31:
  0004b c5 f4 59 ed      vmulps ymm5, ymm1, ymm5               
$LN32:
  0004f c5 cc 58 fd      vaddps ymm7, ymm6, ymm5               
$LN33:
  00053 c5 fc 58 c7      vaddps ymm0, ymm0, ymm7               
$LN34:
  00057 c5 fc 11 04 81   vmovups YMMWORD PTR [ecx+eax*4], ymm0 
$LN35:
  0005c 83 c0 08         add eax, 8                            
$LN36:
  0005f 3b c2            cmp eax, edx                          
$LN37:
  00061 7c d3            jl .B2.3 ; Prob 82%  

I would encorage you to check the version of compiler you are currently using.

; mark_description "Intel C++ Intel 64 Compiler XE for applications running on Intel 64, Version 12.1.0.233 Build 20110";
; mark_description "811";
; mark_description "-c -Qvc10 -Qlocation,link,$(VCInstallDir)binx86_amd64 -I..do3d -nologo -W3 -MP -O2 -Ob2 -Oi -Ot -Qip -";
; mark_description "Qftz -D WIN32 -D NDEBUG -D _LIB -D _CRT_SECURE_NO_WARNINGS -D USE_AVX -D PRODUCTIONX -EHs -EHc -MT -GS- -fp:";
; mark_description "fast -Zc:wchar_t -Zc:forScope -Qrestrict -FAs -Fax64AVX -Fox64AVX -Fdx64AVXvc100.pdb -TP -QxAVX";

above the top lines of the ASM dump

the VS 2010 About -> Copy Info exports the following:

Microsoft Visual Studio 2010
Version 10.0.40219.1 SP1Rel
Microsoft .NET Framework
Version 4.0.30319 SP1Rel

Installed Version: Professional

Microsoft Office Developer Tools 01018-169-2660007-70637
Microsoft Office Developer Tools

Microsoft Visual Basic 2010 01018-169-2660007-70637
Microsoft Visual Basic 2010

Microsoft Visual C# 2010 01018-169-2660007-70637
Microsoft Visual C# 2010

Microsoft Visual C++ 2010 01018-169-2660007-70637
Microsoft Visual C++ 2010

Microsoft Visual F# 2010 01018-169-2660007-70637
Microsoft Visual F# 2010

Microsoft Visual Studio 2010 Team Explorer 01018-169-2660007-70637
Microsoft Visual Studio 2010 Team Explorer

Microsoft Visual Web Developer 2010 01018-169-2660007-70637
Microsoft Visual Web Developer 2010

Crystal Reports Templates for Microsoft Visual Studio 2010
Crystal Reports Templates for Microsoft Visual Studio 2010

Hotfix for Microsoft Visual Studio 2010 Professional - ENU (KB2522890) KB2522890
This hotfix is for Microsoft Visual Studio 2010 Professional - ENU.
If you later install a more recent service pack, this hotfix will be uninstalled automatically.
For more information, visit http://support.microsoft.com/kb/2522890.

Hotfix for Microsoft Visual Studio 2010 Professional - ENU (KB2529927) KB2529927
This hotfix is for Microsoft Visual Studio 2010 Professional - ENU.
If you later install a more recent service pack, this hotfix will be uninstalled automatically.
For more information, visit http://support.microsoft.com/kb/2529927.

Hotfix for Microsoft Visual Studio 2010 Professional - ENU (KB2548139) KB2548139
This hotfix is for Microsoft Visual Studio 2010 Professional - ENU.
If you later install a more recent service pack, this hotfix will be uninstalled automatically.
For more information, visit http://support.microsoft.com/kb/2548139.

Hotfix for Microsoft Visual Studio 2010 Professional - ENU (KB2549864) KB2549864
This hotfix is for Microsoft Visual Studio 2010 Professional - ENU.
If you later install a more recent service pack, this hotfix will be uninstalled automatically.
For more information, visit http://support.microsoft.com/kb/2549864.

Hotfix for Microsoft Visual Studio 2010 Professional - ENU (KB2565057) KB2565057
This hotfix is for Microsoft Visual Studio 2010 Professional - ENU.
If you later install a more recent service pack, this hotfix will be uninstalled automatically.
For more information, visit http://support.microsoft.com/kb/2565057.

Intel C++ Composer XE 2011 Update 6 Package ID: w_ccompxe_2011.6.233
Intel C++ Composer XE 2011 Update 6 Integration for Microsoft Visual Studio* 2010, Version 12.1.1095.2010, Copyright 2002-2011 Intel Corporation
* Other names and brands may be claimed as the property of others

This product includes software developed at The Apache Software Foundation (http://www.apache.org/).

Portions of this software were originally based on the following:
- software copyright (c) 1999, IBM Corporation., http://www.ibm.com.
- software copyright (c) 1999, Sun Microsystems., http://www.sun.com.
- the W3C consortium (http://www.w3c.org) ,
- the SAX project (http://www.saxproject.org)
- voluntary contributions made by Paul Eng on behalf of the Apache Software Foundation that were originally developed at iClick, Inc., software copyright (c) 1999.

This product includes updcrc macro, Satchell Evaluations and Chuck Forsberg. Copyright (C) 1986 Stephen Satchell.

This product includes software developed by the MX4J project (http://mx4j.sourceforge.net).

This product includes ICU 1.8.1 and later.Copyright (c) 1995-2006 International Business Machines Corporation and others.

Portions copyright (c) 1997-2007 Cypress Semiconductor Corporation. All rights reserved.

This product includes XORP. Copyright (c) 2001-2004 International Computer Science Institute

This product includes software from the book "Linux Device Drivers" by Alessandro Rubini and Jonathan Corbet, published by O'Reilly & Associates.

This product includes hashtab.c. Bob Jenkins, 1996.

Microsoft Visual Studio 2010 Professional - ENU Service Pack 1 (KB983509) KB983509
This service pack is for Microsoft Visual Studio 2010 Professional - ENU.
If you later install a more recent service pack, this service pack will be uninstalled automatically.
For more information, visit http://support.microsoft.com/kb/983509.

Microsoft Visual Studio 2010 SharePoint Developer Tools 10.0.40219
Microsoft Visual Studio 2010 SharePoint Developer Tools

thank you, I'm in contact with our compiler team about the case

FYI I just tested with the latest release of the Intel XE 2011 (Intel C++ Intel 64 Compiler XE for applications running on Intel 64, Version 12.1.2.278 Build 20111) and I got the exact same ASM, out of curiosity I tried the /QxCORE-AVX2 option to use the FMA instructions, and there is also a lot of indexed addressing, I suppose it is in fact an optimization but I'll be very interested to learn of the CPU deal with this

.B15.3::                        ; Preds .B15.1 .B15.3
        vmulps    ymm4, ymm1, YMMWORD PTR [rdx+rax*4]           ;41.75
        vmulps    ymm5, ymm0, YMMWORD PTR [rdx+rax*4]           ;42.75
        vfmadd231ps ymm4, ymm2, YMMWORD PTR [rdx+rax*4]         ;41.41
        vfmadd231ps ymm5, ymm3, YMMWORD PTR [rdx+rax*4]         ;42.41
        vaddps    ymm4, ymm4, ymm5                              ;41.27
        vmovups   YMMWORD PTR [rcx+rax*4], ymm4                 ;41.21
        add       rax, 8                                        ;38.25
        cmp       rax, r8                                       ;38.19
        jl        .B15.3        ; Prob 82%                      ;38.19

Melden Sie sich an, um einen Kommentar zu hinterlassen.