<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Wed, 25 Nov 2009 06:53:40 -0800 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/63072/feed" rel="self" type="application/rss+xml" />
    <title>Intel Software Network - <![CDATA[ Why &#34;subq&#34; as allocate by ICC-v10.0 but not as prologue, but ICC-v11.0 uses &#34;pushq&#34; as prologue? ]]> feed</title>
    <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/63072</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Re: Why &amp;#34;subq&amp;#34; as allocate by ICC-v10.0 but not as prologue, but ICC-v11.0 uses &amp;#34;pushq&amp;#34; as prologue?</title>
      <description><![CDATA[ <div style="margin:0px;"></div>
<span style="font-size: 10.0pt; font-family: &quot;Verdana&quot;,&quot;sans-serif&quot;; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: &quot;Times New Roman&quot;; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;">
<div>
<p class="MsoListParagraph" style="text-indent: -.25in; mso-list: l0 level1 lfo1;"><span style="color: #1f497d;"><br /></span></p>
<p class="MsoListParagraph" style="text-indent: -.25in; mso-list: l0 level1 lfo1;"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;; mso-fareast-font-family: Tahoma; color: #1f497d;"><span style="mso-list: Ignore;">(1)<span style="font: 7.0pt &quot;Times New Roman&quot;;">  </span></span></span><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;; color: #1f497d;">In 11.0 version of ICC generates an aligned (to 128) frame for main. This is the intentional behavior and is done for performance reasons though not for this specific example. This where all the uses of %rbp arise (it is used to save the value of %rsp before alignment). If you are confused with this extra alignment, please put your code into a function other than “main”. Consider the prologue for “main” with 10.1:</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma;">main:</span></p>
<p class="MsoNormal"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;;">        subq      $4198408, %rsp                           </span></p>
<p class="MsoNormal" style="text-indent: .5in;"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;; color: #1f497d;">vs 11.0 prologue:</span></p>
<p class="MsoNormal"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;; color: #1f497d;"> <span style="color: #000000; font-family: Tahoma;">main:</span></span></p>
<p class="MsoNormal"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;;">        pushq     %rbp                                        </span></p>
<p class="MsoNormal"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;;">        movq      %rsp, %rbp                              </span></p>
<p class="MsoNormal"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;;">        andq      $-128, %rsp                           </span></p>
<p class="MsoNormal"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;;">        subq      $4198400, %rsp                      </span></p>
<p class="MsoNormal"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;; color: #1f497d;"> </span></p>
<p class="MsoListParagraph" style="text-indent: -.25in; mso-list: l0 level1 lfo1;"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;; mso-fareast-font-family: Tahoma; color: #1f497d;"><span style="mso-list: Ignore;">(2)<span style="font: 7.0pt &quot;Times New Roman&quot;;">  </span></span></span><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;; color: #1f497d;">The “-fno-builtin” option causes the compiler to not expand intrinsics code inline. The code in the example doesn’t make use of any intrinsics which might suggest that this option should have no effect here. Interestingly but it does affect the way the pattern of setting memory to zero is recognized. It may be considered a "bug" in the sense that the behavior is not the one as expected, but I don't fell too strong about it. It is important that both variants are correct to the matter of what the semantics of the option is and they both look reasonably adequate in terms of performance. Generally it is true that the use of “-fno-builtin” would result in a smaller though slower code, but it is incorrect assumption that the code will indeed be smaller. If you are really interested in code size vs. performance, please use option “-Os”.</span></p>
<p class="MsoListParagraph"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;; color: #1f497d;"> </span></p>
<p class="MsoListParagraph" style="text-indent: -.25in; mso-list: l0 level1 lfo1;"><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;; mso-fareast-font-family: Tahoma; color: #1f497d;"><span style="mso-list: Ignore;">(3)<span style="font: 7.0pt &quot;Times New Roman&quot;;">  </span></span></span><span style="font-family: &quot;Tahoma&quot;,&quot;sans-serif&quot;; color: #1f497d;">The “pushq %rsi” code that you refer in (d) has nothing to do to parameter passing for routine “__sti__$E”. It is just an easy way to adjust the stack pointer by 8 bytes to make sure that it is properly aligned to 16-byte boundary at the subsequent call, to conform to x86-64 ABI. The “popq %rcx” is its counterpart in function epilog.</span></p>
<p class="MsoListParagraph" style="text-indent: -.25in; mso-list: l0 level1 lfo1;"><span style="color: #1f497d; font-family: Tahoma;">34) I cannot reproduce the behavior you describe with regards to MOVNTDQ and also your asm snippets don't contain  any MOVNTDQ. Maybe you have more details that you haven't shared?</span></p>
<p class="MsoListParagraph" style="text-indent: -.25in; mso-list: l0 level1 lfo1;"><span style="color: #1f497d; font-family: Tahoma;">Regards,</span></p>
<p class="MsoListParagraph" style="text-indent: -.25in; mso-list: l0 level1 lfo1;"><span style="color: #1f497d; font-family: Tahoma;">-Sergey</span></p>
</div>
</span> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/63072/</link>
      <pubDate>Tue, 27 Jan 2009 04:33:53 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/63072/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: Why &amp;#34;subq&amp;#34; as allocate by ICC-v10.0 but not as prologue, but ICC-v11.0 uses &amp;#34;pushq&amp;#34; as prologue?</title>
      <description><![CDATA[ <div style="margin:0px;"></div>
Hello,<br /><br />Thanks for your valuable input.<br /><br />I have new query similar to above as below -<br /><br />(a) For multiple C++ package file, when I do vectorizations (calling of pragma's) within that file within section of code, I get starting and ending asm as -<br />{<br /> 44d960:            55                                  push   %rbp<br /> 44d961:            48 83 ec 50                   sub    $0x50,%rsp<br /> 44d965:            49 89 f0                         mov    %rsi,%r8<br /> 44d968:            4c 63 c9                        movslq %ecx,%r9<br />...<br /><br />...<br /> 44dc84:            48 83 c4 50                  add    $0x50,%rsp<br /> 44dc88:            5d                                 pop    %rbp<br /> 44dc89:            c3                                 retq<br /> 44dc8a:            90                                 nop<br /> 44dc8b:            48 8d 74 26 00             lea    0x0(%rsi),%rsi<br />}<br /><br />(b) But the same code w/o using any pragma's call, the starting &amp; ending asm are as -<br />{<br /> 44d960:           48 83 ec 68                  sub    $0x68,%rsp<br /> 44d964:           49 89 f9                        mov    %rdi,%r9<br /> 44d967:           49 89 d0                       mov    %rdx,%r8<br /> 44d96a:           4c 63 d1                       movslq %ecx,%r10<br /> ..<br /> ..<br /> ..<br /> 44dc4e:           48 83 c4 68                   add    $0x68,%rsp<br /> 44dc52:           c3                                  retq<br /> 44dc53:           90                                  nop<br /> 44dc54:           48 8d 74 26 00             lea    0x0(%rsi),%rsi<br /> 44dc59:           48 8d bf 00 00 00 00     lea    0x0(%rdi),%rdi<br />}<br />---<br /><br />Query:<br />(1) Could the difference between having PUSH/POP call with pragma vectorization calls and not having w/o it be differentiated?<br /><br />(2) W/o pragma calls, the asm in (b) has "lea" calls twice and also the during starting it has - sub, mov, mov &amp; movslq than with pragma calls, why pragma calls bring such a difference?<br /><br />Sorry, I didn't thought of creating a new thread.<br /><br />~BR ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/63072/</link>
      <pubDate>Fri, 24 Apr 2009 02:29:46 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/63072/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: Why &amp;#34;subq&amp;#34; as allocate by ICC-v10.0 but not as prologue, but ICC-v11.0 uses &amp;#34;pushq&amp;#34; as prologue?</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/407152">srimks</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em> Hello,<br /><br />Thanks for your valuable input.<br /><br />I have new query similar to above as below -<br /><br />(a) For multiple C++ package file, when I do vectorizations (calling of pragma's) within that file within section of code, I get starting and ending asm as -<br />{<br /> 44d960:            55                                  push   %rbp<br /> 44d961:            48 83 ec 50                   sub    $0x50,%rsp<br /> 44d965:            49 89 f0                         mov    %rsi,%r8<br /> 44d968:            4c 63 c9                        movslq %ecx,%r9<br />...<br /><br />...<br /> 44dc84:            48 83 c4 50                  add    $0x50,%rsp<br /> 44dc88:            5d                                 pop    %rbp<br /> 44dc89:            c3                                 retq<br /> 44dc8a:            90                                 nop<br /> 44dc8b:            48 8d 74 26 00             lea    0x0(%rsi),%rsi<br />}<br /><br />(b) But the same code w/o using any pragma's call, the starting &amp; ending asm are as -<br />{<br /> 44d960:           48 83 ec 68                  sub    $0x68,%rsp<br /> 44d964:           49 89 f9                        mov    %rdi,%r9<br /> 44d967:           49 89 d0                       mov    %rdx,%r8<br /> 44d96a:           4c 63 d1                       movslq %ecx,%r10<br /> ..<br /> ..<br /> ..<br /> 44dc4e:           48 83 c4 68                   add    $0x68,%rsp<br /> 44dc52:           c3                                  retq<br /> 44dc53:           90                                  nop<br /> 44dc54:           48 8d 74 26 00             lea    0x0(%rsi),%rsi<br /> 44dc59:           48 8d bf 00 00 00 00     lea    0x0(%rdi),%rdi<br />}<br />---<br /><br />Query:<br />(1) Could the difference between having PUSH/POP call with pragma vectorization calls and not having w/o it be differentiated?<br /><br />(2) W/o pragma calls, the asm in (b) has "lea" calls twice and also the during starting it has - sub, mov, mov &amp; movslq than with pragma calls, why pragma calls bring such a difference?<br /><br />Sorry, I didn't thought of creating a new thread.<br /><br />~BR</em></div>
</div>
</div>
<br />
<div>The prolog/epilog sequences (i.e. what you refer to as "starting and ending asm") can indeed be very sensitive to the other code in the routine &amp; to the level of optimizations applied. Several simple examples can be different register pressure and/or different alignment constraints for the local variables. You shouldn't however be ever able to reason on one sequence or another - it is completely implementation dependent. Well, you may always make a nice guess, but that would be perfectly incorrect to make any assumptions based on such a guess.</div>
<div><br /></div>
<div>BTW, the "lea" instructions you are seeing at the end are just NOPs that are never executed and are their for code alignment purposes only.</div>
<div><br /></div> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/63072/</link>
      <pubDate>Mon, 02 Nov 2009 22:18:24 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/63072/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
  </channel></rss>