<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blogs &#187; Michael Stoner (Intel)</title>
	<atom:link href="http://software.intel.com/en-us/blogs/author/michael-stoner/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<lastBuildDate>Fri, 25 May 2012 22:49:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Restructuring loops for LAME mp3 high-pass filter</title>
		<link>http://software.intel.com/en-us/blogs/2009/09/10/restructuring-loops-for-lame-mp3-high-pass-filter/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/09/10/restructuring-loops-for-lame-mp3-high-pass-filter/#comments</comments>
		<pubDate>Thu, 10 Sep 2009 20:01:43 +0000</pubDate>
		<dc:creator>Michael Stoner (Intel)</dc:creator>
				<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Software Tools]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/09/10/restructuring-loops-for-lame-mp3-high-pass-filter/</guid>
		<description><![CDATA[Here’s another quick performance tip for LAME mp3 encoding.  This nested loop in the function ‘L3psycho_anal_ns’ is a hotspot for constant bit-rate encoding:         for (i = 0; i &#60; 576; i++)         {             FLOAT   sum1, sum2;             sum1 = firbuf[i + 10];             sum2 = 0.0;             for (j = 0; j &#60; ((NSFIRLEN - 1) [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal" style="none"><span style="yes;">Here’s another quick performance tip for LAME mp3 encoding.<span style="yes">  </span>This nested loop in the function ‘</span><span style="ZH-TW;">L3psycho_anal_ns’ is a hotspot for constant bit-rate encoding:</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="ZH-TW;"> </span><span style="ZH-TW;"><span style="yes">  </span><span style="yes">     </span></span>for (i = 0; i &lt; 576; i++)</span></p>
<p class="MsoNormal" style="none;"><span style="ZH-TW;"><span style="ZH-TW;"> </span><span style="ZH-TW;"><span style="yes">  </span><span style="yes">     </span></span>{</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes">            </span></span>FLOAT<span style="yes">   </span>sum1, sum2;</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes">            </span></span>sum1 = firbuf[i + 10];</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes">            </span></span>sum2 = 0.0;</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes">            </span></span></span><span style="ZH-TW;">for (j = 0; j &lt; ((NSFIRLEN - 1) / 2) - 1; j += 2) </span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes">            </span></span></span></span></span>{</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes">            <span style="ZH-TW;"><span style="ZH-TW;"><span style="yes">            </span></span></span></span></span></span></span></span>sum1 += fircoef[j] * (firbuf[i + j] + firbuf[i + NSFIRLEN - j]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes">            <span style="ZH-TW;"><span style="ZH-TW;"><span style="yes">            </span></span></span></span></span></span></span></span>sum2 += fircoef[j + 1] * (firbuf[i + j + 1] + firbuf[i + NSFIRLEN - j - 1]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="yes"><span style="ZH-TW;"><span style="yes"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes">            </span></span></span></span></span></span>}</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="ZH-TW;"><span style="yes"><span style="ZH-TW;"><span style="yes">            </span></span></span></span></span><span style="ZH-TW;">ns_hpfsmpl[chn][i] = sum1 + sum2;</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="ZH-TW;">        </span>}</span><span style="ZH-TW;"> </span></p>
<div></div>
<p><span style="ZH-TW;"></p>
<p class="MsoNormal" style="none;"><span style="yes;">I'm guessing someone tried to optimize the inner loop by unrolling and computing two parallel sums, adding them together for the final sum outside the loop.<span style="yes;">  </span>The idea may have been to break the dependency chain on a single register which accumulates the sum.<span style="yes;">  </span>That could help on a machine that can issue two loads per cycle, where the multi-cycle latency of each add in the summing chain would be exposed.<span style="yes;">  </span>However current Intel CPU’s can only do one load per cycle, so the code will remain bound by the three loads required per result, regardless of the unrolling.</span></p>
<p class="MsoNormal" style="none">Note that NSFIRLEN is a predefined macro set to the equal 21.<span style="yes">  </span>Since the inner loop count is a constant we can unroll it by hand:</p>
<p></span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"> </span><span style="ZH-TW;"><span style="yes">  </span><span style="yes">     </span><span style="#0000ff;">for</span> (i = 0; i &lt; 576; i++)</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;">        {</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="yes">            </span>FLOAT<span style="yes">   </span>sum1;</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="yes">            </span>sum1 = firbuf[i + 10];</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="2">            </span>sum1 += fircoef[0] * (firbuf[i + 0] + firbuf[i + NSFIRLEN - 0]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="3">            </span>sum1 += fircoef[1] * (firbuf[i + 1] + firbuf[i + NSFIRLEN - 1]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="3">            </span>sum1 += fircoef[2] * (firbuf[i + 2] + firbuf[i + NSFIRLEN - 2]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="3">            </span>sum1 += fircoef[3] * (firbuf[i + 3] + firbuf[i + NSFIRLEN - 3]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="3">            </span>sum1 += fircoef[4] * (firbuf[i + 4] + firbuf[i + NSFIRLEN - 4]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="3">            </span>sum1 += fircoef[5] * (firbuf[i + 5] + firbuf[i + NSFIRLEN - 5]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="3">            </span>sum1 += fircoef[6] * (firbuf[i + 6] + firbuf[i + NSFIRLEN - 6]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="3">            </span>sum1 += fircoef[7] * (firbuf[i + 7] + firbuf[i + NSFIRLEN - 7]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="3">            </span>sum1 += fircoef[8] * (firbuf[i + 8] + firbuf[i + NSFIRLEN - 8]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="3">            </span>sum1 += fircoef[9] * (firbuf[i + 9] + firbuf[i + NSFIRLEN - 9]);</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"> </span><span style="ZH-TW;"><span style="yes">           </span>ns_hpfsmpl[chn][i] = sum1;</span></p>
<p class="MsoNormal" style="none"><span style="ZH-TW;"><span style="yes">        </span>}</span></p>
<p class="MsoNormal" style="none"><span style="yes;">This allows the Intel Compiler to vectorize with packed SSE instructions around the outer loop and overall leads to a 3% reduction in encode time.<span style="yes">  </span></span></p>
<p class="MsoNormal" style="none"><span style="yes;">Since SSE operates on four single-precision float elements held in 128-bit registers, the compiler achieves vectorization by unrolling the loop by four and computing four sums line-by-line.<span style="yes">  </span>Vectorization around the original inner loop is possible but not as efficient since it has a small trip count (10) and requires horizontal summing of each result.</span></p>
<p class="MsoNormal" style="none"><span style="yes;">The outer-loop approach also takes advantage of the PALIGNR instruction to form operands for each array offset in registers rather than doing a bunch of unaligned 16-byte loads from memory.<span style="yes">  </span>It ends up being over 200 lines of assembly code for the whole loop... really makes you appreciate having a smart compiler!</span></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/09/10/restructuring-loops-for-lame-mp3-high-pass-filter/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using SSE4.1 for mp3 encoding quantization</title>
		<link>http://software.intel.com/en-us/blogs/2009/01/07/using-sse41-for-mp3-encoding-quantization/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/01/07/using-sse41-for-mp3-encoding-quantization/#comments</comments>
		<pubDate>Wed, 07 Jan 2009 23:16:35 +0000</pubDate>
		<dc:creator>Michael Stoner (Intel)</dc:creator>
				<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[SSE4.1 LAME mp3 encoding]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/01/07/using-sse41-for-mp3-encoding-quantization/</guid>
		<description><![CDATA[In this post I'd like to promote the new SSE 4.1 instruction set extension as it relates to the quantization loop I wrote about a few months ago. As you may recall, the modified code from ‘quantize_xrpow_lines" looked like this: for(i=0; i &#60; l; i++)    {       float x0 = xr[i] * istep;    [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal" style="0in 0in 0pt;">In this post I'd like to promote the new SSE 4.1 instruction set extension as it relates to the quantization loop I wrote about a few months ago. As you may recall, the modified code from ‘quantize_xrpow_lines" looked like this:</p>
<p>  for(i=0; i &lt; l; i++)<br />
   {<br />
      float x0 = xr[i] * istep;<br />
      int rx0 = (int)x0;<br />
      x0 += adj43[rx0];<br />
      ix[i] = (int)x0;<br />
   }</p>
<p>The function consumes about 10% of the run-time for constant bit-rate encoding. I had enabled compiler vectorization using "#pragma vector always" but the resulting SSE2 code didn't make a dent in the overall encode time.</p>
<p>The critical bottleneck with this loop is the "gather" or table-lookup sequence which demands individual reads from the adj43 array for each result. Our instruction set currently does not provide a method to execute such parallel loads, so each rx0 element must be extracted into a general purpose register (eax, ebx, etc.) to set up the loads and likewise each float value retrieved as adj43[rx0] must be packed back into an xmm register. The SSE2 form of this algorithm needs 22 total instructions to produce four results in one register.</p>
<p>My first crack at an SSE 4.1 version used the PEXTRD and INSERTPS instructions to tighten up the sequence in moving the data to and from the GPR's. This shortened the whole loop to 14 instructions, but oddly didn't improve performance by any measure. Breaking the function out into a micro-kernel testing jig actually showed that this version was slightly slower than the SSE2 code on a Nehalem 3.2 Ghz system.</p>
<p>So what happened? I understood this better after digging into the micro-op sequences used in decoding each instruction. Micro-ops are the RISC-like elements that combine to execute each x86 instruction inside the CPU execution pipeline. Naturally, I wanted to choose instructions that decode into shorter micro-op flows so the CPU could get by doing less work. Our CPU definition files showed that PEXTRD and INSERTPS decode to 2 and 3 micro-ops respectively. So, while I reduced instruction count at the externally-visible level, my SSE4.1 code actually increased the number of micro-ops from 25 to 26 for the whole loop.</p>
<p>Curiously, while I was inspecting the micro-op flows I noticed that EXTRACTPS and PINSRD each used one less micro-op than their counterparts noted above. The difference between PEXTRD and EXTRACTPS is mainly conceptual based on the data type, e.g. PEXTRD is for extracting 32-bit integers (DWORD's) and EXTRACTPS is for 32-bit floats (Packed Single-precision data). I intuitively chose PEXTRD and INSERTPS because the algorithm was extracting DWORD array indices and inserting floats. So why the extra micro-ops? In the extract case, I am not sure... However, the INSERTPS instructions does offer the added ability to select the input operand from another xmm register, so that explains why it uses one more shuffle micro-op than PINSRD.</p>
<p>Another factor to consider is port bindings. Each micro-op can only be dispatched on a certain subset of the ports (0-5 on Nehalem) and each port is limited to dispatching one operation per cycle (try Googling "Nehalem block diagram" for a good visual). My initial SSE4.1 loop was bound by having seven micro-ops that map to Port 0, so it had an ideal throughput of 7 cycles. To figure this out I listed all 26 micro-ops and tried to arrange the port mappings so they would be spread as evenly as possible over all ports. It sounds complex but some people here can do it in their head... for almost any piece of code... on any CPU from the 1995 Pentium Pro out to what we are designing for 2013. (In a similar fashion I often proof-read these blog posts over and over trying to figure out how I can reduce the number of words... but I'm not going to do that today ).</p>
<p>After all that I rewrote my algorithm using the combination of EXTRACTPS and PINSRD and saw a 2% speedup in encode time. Not much you might think, but consider I only used two new instructions. The selection went against the data type in each case but nonetheless reduced the micro-op count from 26 to 20. Also the throughput improved to 5 cycles/result, bound on ports 1 and 2.</p>
<p>Here is my intrinsics implementation, a bit ugly for the need to use casts to match the data types:</p>
<p><span style="#333399;">   int i, t0, t1, t2, t3;<br />
   __m128 x4h, xr4, x4, istep4 = _mm_set1_ps(istep);<br />
   __m128i rx4;</span></p>
<p><span style="#333399;">   for(i=0; i &lt; l; i+=4)<br />
   {<br />
      xr4 = _mm_load_ps(&amp;xr[i]);<br />
      xr4 = _mm_mul_ps(xr4, istep4);<br />
      rx4 = _mm_cvttps_epi32(xr4);</span></p>
<p><span style="#333399;">      t0 = _mm_cvtsi128_si32(rx4);<br />
      t1 = _mm_extract_ps(_mm_castsi128_ps(rx4), 1);<br />
      t2 = _mm_extract_ps(_mm_castsi128_ps(rx4), 2);<br />
      t3 = _mm_extract_ps(_mm_castsi128_ps(rx4), 3);</span></p>
<p><span style="#333399;">      x4 = _mm_load_ss(&amp;adj43[t0]);<br />
      x4 = _mm_castsi128_ps(_mm_insert_epi32(_mm_castps_si128(x4), *(int*)&amp;adj43[t1], 1));<br />
      x4 = _mm_castsi128_ps(_mm_insert_epi32(_mm_castps_si128(x4), *(int*)&amp;adj43[t2], 2));<br />
      x4 = _mm_castsi128_ps(_mm_insert_epi32(_mm_castps_si128(x4), *(int*)&amp;adj43[t3], 3));</span></p>
<p><span style="#333399;">      xr4 = _mm_add_ps(xr4, x4);<br />
      rx4 = _mm_cvttps_epi32(xr4);<br />
      _mm_store_si128((__m128i*)&amp;ix[i], rx4);</span></p>
<p><span style="#333399;">   }</span></p>
<p>Alternatively I have put in a request for the Intel Compiler to emit similar code when vectorizing with the /QxS switch (targeting SSE4.1 code generation). This could appear in the 11.0 release but no guarantees as yet.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/01/07/using-sse41-for-mp3-encoding-quantization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Another tip for faster mp3 encoding</title>
		<link>http://software.intel.com/en-us/blogs/2008/10/31/another-tip-for-faster-mp3-encoding/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/10/31/another-tip-for-faster-mp3-encoding/#comments</comments>
		<pubDate>Sat, 01 Nov 2008 00:05:46 +0000</pubDate>
		<dc:creator>Michael Stoner (Intel)</dc:creator>
				<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/10/31/another-tip-for-faster-mp3-encoding/</guid>
		<description><![CDATA[In this entry I want to highlight a loop in the ‘count_bits’ function which yielded a 1.15x app-level gain when we coaxed it to vectorize with the Intel Compiler.  After disabling Takehiro’s float-to-int hack, this was the top hotspot in our constant bit-rate encoding workload:  for (l = -width; l &#60; 0; l++)             [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal"><span>In this entry I want to highlight a loop in the ‘count_bits’ function which yielded a 1.15x app-level gain when we coaxed it to vectorize with the Intel Compiler.<span>  </span>After disabling Takehiro’s float-to-int hack, this was the top hotspot in our constant bit-rate encoding workload:</span><span> </span></p>
<p class="MsoNormal"><span>for</span><span> (l = -width; l &lt; 0; l++)</span></p>
<p class="MsoNormal"><span><span>      </span><span>      </span><span>if</span> (xr[j + l] &lt; roundfac)</span></p>
<p class="MsoNormal"><span><span> </span><span>                 </span>ix[j + l] = 0;</span></p>
<p class="MsoNormal"> </p>
<p class="MsoNormal"><span>The loop exhibits no data dependencies between iterations and contains no unvectorizable operations.<span>  </span>The trip count (width) is occasionally as small as ‘4’ but often much larger.<span>  </span>However, the compiler by default would not emit SIMD code here, for a few different reasons:</span></p>
<p class="MsoNormal"><span> </span><span><span><span><span>1)</span><span>      </span></span></span><span dir="ltr"><span>Because the array arguments ‘xr’ and ‘ix’ are passed into the function, the compiler cannot assume their address ranges don’t overlap.<span>  </span>If they did, vectorizing the code could lead to incorrect results.<span>  </span>This is commonly known as ‘aliasing’ and can be circumvented either by adding “#pragma ivdep” above the for statement, or compiling with /Oa or /Ow switches.</span></span></span></p>
<p class="MsoNormal"><span><span><span><span>2)</span><span>      </span></span></span><span dir="ltr"><span>With that fixed, we get a report that the loop was not vectorized due to an “unsupported loop structure”.<span>  </span>Generally, the compiler likes simplicity, so we tried recoding the loop with single-operand indices, as such:</span></span></span></p>
<p class="MsoNormal"><span> </span><span>for</span><span> (k = j, j += width; k &lt; j; ++k)</span></p>
<p class="MsoNormal"><span><span>            </span><span>if</span> (xr[k] &lt; roundfac)</span></p>
<p class="MsoNormal"><span><span>                  </span>ix[k] = 0;</span></p>
<p class="MsoNormal"><span><span>                      </span></span></p>
<p class="MsoNormal"><span><span><span><span>3)</span><span>      </span></span></span><span dir="ltr"><span>That left us with one remaining issue, a bit more esoteric this time – the compiler reporting “loop not vectorized:<span>  </span>condition may protect exception”.<span>  </span>So it fears that the address range dereferenced by ‘ix’ may not be continuous?<span>  </span>That seems unlikely, but fair enough… A SIMD scheme would typically write to every address in the range, either with the existing value or zero, based on a mask generated by the conditional operation. </span></span></span></p>
<p class="MsoNormal"><span>The solution here required some input from compiler engineering.<span>  </span>Eventually we arrived at the loop below, to alleviate concerns about writing through the range of ‘ix’:</span></p>
<p class="MsoNormal"><span> </span><span><span>      </span><span>for</span> (k = j, j += width; k &lt; j; ++k)</span></p>
<p class="MsoNormal"><span><span>            </span>ix[k] = (xr[k] &gt;= roundfac) ? ix[k] : 0; </span></p>
<p class="MsoNormal"> </p>
<p class="MsoNormal"><span>The compiler sure seems to like that ? operator.<span>  </span>It always worked best for min/max problems on some of my previous projects in financial computing (unfortunately none of those apps advised me to unload my whole portfolio...).</span></p>
<p class="MsoNormal"><span>Finally we came out with a concise SSE2 loop:</span></p>
<p class="MsoNormal"><span> </span></p>
<p class="MsoNormal"><span>.B7.32:</span></p>
<p class="MsoNormal"><span><span>        </span>movups<span>    </span>xmm0, XMMWORD PTR [ecx+eax*4]</span></p>
<p class="MsoNormal"><span><span>        </span>movups<span>    </span>xmm2, XMMWORD PTR [16+ecx+eax*4]<span>              </span></span></p>
<p class="MsoNormal"><span><span>        </span>movaps<span>    </span>xmm1, xmm3<span>                             </span><span>       </span></span></p>
<p class="MsoNormal"><span><span>        </span>cmpleps<span>   </span>xmm1, xmm0<span>                                    </span></span></p>
<p class="MsoNormal"><span><span>        </span>pand<span>      </span>xmm1, XMMWORD PTR [ebx+eax*4]<span>                 </span></span></p>
<p class="MsoNormal"><span><span>        </span>movdqa<span>    </span>XMMWORD PTR [ebx+eax*4], xmm1<span>                 </span></span></p>
<p class="MsoNormal"><span><span>        </span>movaps<span>    </span>xmm4, xmm3<span>                                    </span></span></p>
<p class="MsoNormal"><span><span>        </span>cmpleps<span>   </span>xmm4, xmm2<span>                                    </span></span></p>
<p class="MsoNormal"><span><span>        </span>pand<span>      </span>xmm4, XMMWORD PTR [16+ebx+eax*4]<span>              </span></span></p>
<p class="MsoNormal"><span><span>        </span>movdqa<span>    </span>XMMWORD PTR [16+ebx+eax*4], xmm4<span>              </span></span></p>
<p class="MsoNormal"><span><span>        </span>add<span>       </span>eax, 8<span>                                       </span></span></p>
<p class="MsoNormal"><span><span>        </span>cmp<span>       </span>eax, esi<span>                                    </span></span></p>
<p class="MsoNormal"><span><span>        </span>jb<span>        </span>.B7.32<span>        </span></span></p>
<p class="MsoNormal"><span> </span></p>
<p class="MsoNormal"><span>Again, this provided a 15% speedup for constant bit-rate encoding, as measured on a 2.9 Ghz Core 2 Duo (Merom) system running WinXP 32-bit.</span></p>
<p class="MsoNormal"><span>The loop restructuring was checked into the LAME source tree back in August.<span>  </span>We’d hoped to have the “#pragma ivdep” inserted as well, but the development community prefers to avoid compiler-specific directives.<span>  </span>So, short of adding the #pragma yourself, we recommend using /Oa or /Ow along with your favorite SSE code generation switch (/QxW or better).</span></p>
<p class="MsoNormal"><span> </span></p>
<table border="1" cellspacing="0" cellpadding="5" rules="none">
<tbody>
<tr>
<th align="left" valign="middle">Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors.  In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors.  For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options."  Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors.  While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.</p>
<p>Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.  These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations.  Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.  Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.</p>
<p>While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements.  We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.</p>
<p>Notice revision #20101101</td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/10/31/another-tip-for-faster-mp3-encoding/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Open source project - LAME mp3 encoder optimization</title>
		<link>http://software.intel.com/en-us/blogs/2008/10/06/open-source-project-lame-mp3-encoder-optimization/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/10/06/open-source-project-lame-mp3-encoder-optimization/#comments</comments>
		<pubDate>Tue, 07 Oct 2008 00:17:04 +0000</pubDate>
		<dc:creator>Michael Stoner (Intel)</dc:creator>
				<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/10/06/open-source-project-lame-mp3-encoder-optimization/</guid>
		<description><![CDATA[One of the nice things about working on open source code is that any interesting findings can be freely discussed, such as in this blog.  With that in mind I recently took up a project to optimize performance of the popular LAME mp3 encoder.  Over the years I had seen LAME used in several other [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal" style="0in 0in 0pt;"><span style="Times New Roman;">One of the nice things about working on open source code is that any interesting findings can be freely discussed, such as in this blog.<span style="yes;">  </span>With that in mind I recently took up a project to optimize performance of the popular LAME mp3 encoder.<span style="yes;">  </span>Over the years I had seen LAME used in several other studies involving threading, compiler optimization, new architecture evaluation and the like. <span style="yes;"> </span>I wasn’t sure if any new frontier remained for me to discover.<span style="yes;">  </span>However, an initial VTune profiling session turned up some “low hanging fruit” optimization targets that I picked apart for a 70% reduction in encode time.<span style="yes;">  </span>I’ll try to cover these changes in detail over the next few posts.</span></p>
<div><span style="Times New Roman;"><span style="Times New Roman;">Before going deep on the optimization discussion, I should note that I tested on a 2.9 Ghz Core2 Duo (Merom) system running WinXP 32-bit.<span style="yes;">  </span>I used MSVC++ 2005 for a baseline compile, with –O2 optimization settings.<span style="yes;">  </span>My workload was a 10-minute .wav file compressed with settings “-h –b160 –nores” (not uncoincidentally, the same settings used by TomsHardware.com in their benchmarks).</span></span></div>
<p><span style="Times New Roman;"> </p>
<p></span></p>
<p class="MsoNormal" style="0in 0in 0pt;"><span style="Times New Roman;">The first thing that jumped out from the VTune run was a function called “quantize_xrpow_lines”.<span style="yes;">  </span>This was one of the top hotspot functions, consuming about 10% of the total run-time.<span style="yes;">  </span>Here is a link to the latest source file residing on Sourceforge:</span></p>
<p class="MsoNormal" style="0in 0in 0pt;"><a href="http://lame.cvs.sourceforge.net/viewvc/lame/lame/libmp3lame/takehiro.c?revision=1.75&amp;view=markup"><span style="Times New Roman;">http://lame.cvs.sourceforge.net/viewvc/lame/lame/libmp3lame/takehiro.c?revision=1.75&amp;view=markup</span></a></p>
<p class="MsoNormal" style="0in 0in 0pt;"><span style="Times New Roman;">The code employs a bit of trickery known as the “Takehiro IEEE754 hack” which uses a sequence of adds to convert a floating point value to its integer counterpart.<span style="yes;">  </span>The hack was conceived back in 2000, during the era of MSVC++ 6.0 which used an expensive <em>_ftol</em> service routine to convert floats to ints.<span style="yes;">  </span>Coincidentally I wrote an article about this issue back then that is still on-line at </span><a href="http://software.intel.com/en-us/articles/fast-floating-point-to-integer-conversions/"><span style="Times New Roman;">http://software.intel.com/en-us/articles/fast-floating-point-to-integer-conversions/</span></a><span style="Times New Roman;"> (note you may need to Google for it, the page has moved around more than transient NBA coach Larry Brown over the years).<span style="yes;">  </span>The gist of the paper is that prior to the Pentium III, the x86 ISA did not provide instructions that explicitly performed the float-to-int truncation cast required by the ANSI C standard.<span style="yes;">  </span>The <em>ftol</em> routine had to modify the FP control word to achieve this behavior on the x87 stack.<span style="yes;">  </span>At that time, the “magic float” hack was a method to improve convert performance without requiring any of the new SSE instructions that were emerging on the latest CPU’s.</span></p>
<p class="MsoNormal" style="0in 0in 0pt;"><span style="Times New Roman;">Getting back to the present day mindset, SSE has been around for years and the most common compilers will at least generate scalar forms of the instructions.<span style="yes;">  </span>Also, the hardware implementations of convert-truncate instructions (e.g. CVTTSS2SI) have improved to where they only take a few cycles on Core2 Duo. <span style="yes;"> </span>Since the hotspot code does two such converts in a tight loop, I wanted to see if the Takehiro hack was still providing the benefits originally intended.</span></p>
<p class="MsoNormal" style="0in 0in 0pt;"><span style="Times New Roman;">Ultimately I found that disabling the hack and reverting to the original code leads to 30% faster encoding time, under certain compilation conditions.<span style="yes;">  </span>The best result comes from building with Intel Compiler 10.1 and the –QxT switch which targets Supplemental SSE3 code generation (note, -QxW SSE2 code generation is nearly as fast but I’m sure our compiler group would like to promote the latest switches).<span style="yes;">  </span>MSVC++ 2005 compilation also chips off significant encoding time as long as you use the switch combination “/arch:SSE2 /fp:fast”.<span style="yes;">  </span>Those parameters allow the compiler to generate SSE2 code by default and relax precision requirements.<span style="yes;">  </span>(Without the latter, MSVC will do all calculations in double precision, even if the source code specifies <em>float</em> data types. <span style="yes;"> </span>Double precision is free on the x87 floating-point stack, but in the SSE context you’ll see many CVTSS2SD and CVTSD2SS instructions throughout the code which will cripple performance.)</span></p>
<p class="MsoNormal" style="0in 0in 0pt;"><span style="Times New Roman;">Finally, though it only gave a modest performance gain, I came up with a more concise coding of the “quantize_xrpow_lines” loop:</span></p>
<blockquote><p><span style="yes;"><span style="1;">      </span><span style="blue;">for</span>(i=0; i &lt; l; i++)</span> <span style="yes;"><span style="1;">      </span>{</span> <span style="yes;"><span style="2;">            </span><span style="blue;">float</span> x0 = xr[i] * istep;</span> <span style="yes;"><span style="2;">            </span><span style="blue;">int</span> rx0 = (<span style="blue;">int</span>)x0;</span> <span style="yes;"><span style="2;">            </span>x0 += adj43[rx0];</span> <span style="yes;"><span style="2;">            </span>ix[i] = (<span style="blue;">int</span>)x0;</span> <span style="yes;"><span style="1;">      </span>}</span></p></blockquote>
<p><span style="Times New Roman;">The <em>adj43</em> table lookup prevents a straightforward SIMD implementation,but the Intel Compiler can still vectorize this if you specify “#pragma vector always”.<span style="yes;">  </span>It uses shift and unpack operations to extract the <em>rx0</em> indices into general purpose registers and gather the array values back into one <em>xmm</em> register.<span style="yes;">  </span>This measured about 20% faster in a microkernel (aka a test app separate from the full encoder), but didn’t trim off any appreciable encode time. <span style="yes;"> </span>Nonetheless, restructuring the loop in this fashion leaves it in better position to leverage SIMD hardware improvements down the line.</span><span style="Times New Roman;"> </span></p>
<table cellpadding="5" cellspacing="0" rules="none" border="1">
<tbody>
<tr>
<th align="left" valign="middle" style="background-color: #555555; height: 30px; color: white;">Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>
<p>Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors.  In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors.  For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options."  Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors.  While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.</p>
<p>Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.  These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations.  Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.  Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.</p>
<p>While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements.  We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.</p>
<p>Notice revision #20101101</p>
</td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/10/06/open-source-project-lame-mp3-encoder-optimization/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Assessing the accelerator buzz:  Another tip for faster Monte Carlo computing</title>
		<link>http://software.intel.com/en-us/blogs/2008/07/30/assessing-the-accelerator-buzz-another-tip-for-faster-monte-carlo-computing/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/07/30/assessing-the-accelerator-buzz-another-tip-for-faster-monte-carlo-computing/#comments</comments>
		<pubDate>Thu, 31 Jul 2008 01:18:17 +0000</pubDate>
		<dc:creator>Michael Stoner (Intel)</dc:creator>
				<category><![CDATA[Intel SW Partner Program]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/07/30/assessing-the-accelerator-buzz-another-tip-for-faster-monte-carlo-computing/</guid>
		<description><![CDATA[Continuing with the GaussianRand example, a 1.5x gain is nice but were there additional opportunities for performance gains?  Of course there were! (That was a rhetorical question…)  Seeing as floating point divides are among the longer latency operations, we should look at the two that are coded into the do/while loop to normalize the random [...]]]></description>
			<content:encoded><![CDATA[<p>Continuing with the GaussianRand example, a 1.5x gain is nice but were there additional opportunities for performance gains?  Of course there were! (That was a rhetorical question…)  Seeing as floating point divides are among the longer latency operations, we should look at the two that are coded into the do/while loop to normalize the random numbers:</p>
<p>        do {<br />
          x1 = 2.0 * random()/RAND_MAX - 1.0;<br />
          x2 = 2.0 * random()/RAND_MAX - 1.0;<br />
          w = x1 * x1 + x2 * x2;<br />
        } while ( w &gt;= 1.0 );</p>
<p>The x1 and x2 computations take a random integer from 0 to RAND_MAX and normalize it into the range -1.0 to 1.0.  While we might expect the compiler to reduce this to a single multiply by the constant (2.0/RAND_MAX) and then subtract 1.0, we can’t assume anything.  Take a look at the assembly listing:</p>
<p>        call      random                                        #55.15<br />
..B2.4:<br />
        cvtsi2sdq %rax, %xmm0                                   #55.15<br />
        addsd     %xmm0, %xmm0                                  #55.15<br />
        divsd     _2il0floatpacket.1(%rip), %xmm0               #55.24<br />
        subsd     _2il0floatpacket.3(%rip), %xmm0               #55.35<br />
        movsd     %xmm0, 24(%rsp)                               #55.35<br />
        call      random                                        #56.15<br />
..B2.5:<br />
        cvtsi2sdq %rax, %xmm4                                   #56.15<br />
        movsd     _2il0floatpacket.3(%rip), %xmm2               #56.35<br />
        addsd     %xmm4, %xmm4                                  #56.15<br />
        divsd     _2il0floatpacket.1(%rip), %xmm4               #56.24<br />
        movsd     24(%rsp), %xmm0                               #57.13<br />
        subsd     %xmm2, %xmm4                                  #56.35</p>
<p>Even without being an assembly wizard, you might detect that the two calls to random() are soon followed with divides by some constant value.  Those are going to chew up a lot of clock cycles.  So we should get a nice gain by explicitly folding this into a multiply:</p>
<p>    const double RMrcp = 2.0/RAND_MAX;</p>
<p>        for (int i = 0; i &lt; LENGTH; i++)<br />
          {<br />
            do<br />
            {<br />
              x1 = random()*RMrcp - 1.0;<br />
              x2 = random()*RMrcp - 1.0;<br />
              w = x1 * x1 + x2 * x2;<br />
            } while ( w &gt;= 1.0 );<br />
            <br />
            _x1[i] = x1;<br />
            _x2[i] = x2;<br />
            _w[i] = w;</p>
<p>          }</p>
<p>This quick code mod pushed the speedup to 1.9x.</p>
<p> </p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/07/30/assessing-the-accelerator-buzz-another-tip-for-faster-monte-carlo-computing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Assessing the accelerator buzz:  Vectorization of Monte Carlo algorithms</title>
		<link>http://software.intel.com/en-us/blogs/2008/07/15/assessing-the-accelerator-buzz-vectorization-of-monte-carlo-algorithms/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/07/15/assessing-the-accelerator-buzz-vectorization-of-monte-carlo-algorithms/#comments</comments>
		<pubDate>Tue, 15 Jul 2008 22:45:18 +0000</pubDate>
		<dc:creator>Michael Stoner (Intel)</dc:creator>
				<category><![CDATA[Intel SW Partner Program]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/07/15/assessing-the-accelerator-buzz-vectorization-of-monte-carlo-algorithms/</guid>
		<description><![CDATA[Now we’ll take a look at optimizing something more interesting and complex.  Since we can’t show much of the customer source we work on, we’ll look at some public domain code from the internet, specifically this Box Muller random number transformation from http://www.taygeta.com/random/gaussian.html:       for (int i = 0; i &#60; LENGTH; i++)       [...]]]></description>
			<content:encoded><![CDATA[<p>Now we’ll take a look at optimizing something more interesting and complex.  Since we can’t show much of the customer source we work on, we’ll look at some public domain code from the internet, specifically this Box Muller random number transformation from <a href="http://www.taygeta.com/random/gaussian.html">http://www.taygeta.com/random/gaussian.html</a>:</p>
<p> <br />
    for (int i = 0; i &lt; LENGTH; i++)<br />
      {<br />
        double w, x1, x2;</p>
<p>        do {<br />
          x1 = 2.0 * random()/RAND_MAX - 1.0;<br />
          x2 = 2.0 * random()/RAND_MAX - 1.0;<br />
          w = x1 * x1 + x2 * x2;<br />
        } while ( w &gt;= 1.0 );</p>
<p>        w = sqrt( (-2.0 * log( w ) ) / w );<br />
        y1[i] = x1 * w;<br />
        y2[i] = x2 * w;<br />
      }</p>
<p> <br />
This will produce a stream of Gaussian pseudo-random numbers saved into the arrays y1 and y2.</p>
<p>Going in we know that the Intel Compiler will not vectorize the do/while loop because it has an unpredictable loop count that is dependent on the results of each pass through the code.  The outer for loop won’t vectorize either as the compiler is not yet capable of such magic.</p>
<p>Between the square root, log, and divides we have a lot of expensive stuff going on here.  It would pay to restructure the code into something that the compiler can recognize as a vectorization candidate.  This can be done by splitting the outer loop into two for loops and saving intermediate values into arrays, as such:</p>
<p>double x1, x2, w;<br />
__declspec(align(16)) double _w[LENGTH], _x1[LENGTH], _x2[LENGTH];</p>
<p>for (int i = 0; i &lt; COUNT; i++)<br />
      {<br />
       do<br />
            {<br />
              x1 = 2.0*random()/RAND_MAX - 1.0;<br />
              x2 = 2.0*random()/RAND_MAX - 1.0;<br />
              w = x1 * x1 + x2 * x2;<br />
            } while ( w &gt;= 1.0 );</p>
<p>            _x1[i] = x1;<br />
            _x2[i] = x2;<br />
            _w[i] = w;</p>
<p>    }</p>
<p>#pragma ivdep<br />
#pragma vector aligned<br />
for (int i = 0; i &lt; LENGTH; i++)<br />
          {<br />
            w = _w[i];<br />
            w = sqrt( (-2.0 * log( w ) ) / w );<br />
            y1[i] = _x1[i] * w;<br />
            y2[i] = _x2[i] * w;0<br />
          }</p>
<p> <br />
With this restructuring we’ve isolated the sqrt/log/divide sequence into a loop simple enough for the compiler to vectorize.  The #pragma’s instruct it to ignore potential vector dependencies, i.e. pointer aliasing, and assume that the arrays exhibit 16-byte alignment (guaranteed by adding “__declspec(align(16))” to the declarations).</p>
<p> <br />
$ icc -xP -vec_report3  GaussianRand.C<br />
...<br />
GaussianRand.C(85): (col. 21) remark: loop was not vectorized: unsupported loop structure.<br />
GaussianRand.C(95): (col. 2) remark: LOOP WAS VECTORIZED.</p>
<p> </p>
<p>This generated a healthy 1.5x speedup on the algorithm.</p>
<p>Note that the compiler employs a “short vector math library” which contains SIMD versions of math calls such as log, exp, sin, cos, etc.  This enables vectorization on loops such as the one above.  You can see how SVML is used in the assembly listing:</p>
<p>..B3.9:                        <br />
        movaps    16032(%rsp,%rbp), %xmm8                       #97.10<br />
        movaps    16032(%rsp,%rbp), %xmm0                       #98.29<br />
        call      __svml_log2                                   #98.29<br />
        mulpd     %xmm9, %xmm0                                  #98.24<br />
        divpd     %xmm8, %xmm0                                  #98.37<br />
        sqrtpd    %xmm0, %xmm2                                  #98.10<br />
        movaps    32(%rsp,%rbp), %xmm1                          #99.24<br />
...</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/07/15/assessing-the-accelerator-buzz-vectorization-of-monte-carlo-algorithms/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Assessing the accelerator buzz: Tips and Tricks for Intel® Compiler vectorization</title>
		<link>http://software.intel.com/en-us/blogs/2008/06/26/assessing-the-accelerator-buzz-tips-and-tricks-for-intel-compiler-vectorization/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/06/26/assessing-the-accelerator-buzz-tips-and-tricks-for-intel-compiler-vectorization/#comments</comments>
		<pubDate>Thu, 26 Jun 2008 18:08:16 +0000</pubDate>
		<dc:creator>Michael Stoner (Intel)</dc:creator>
				<category><![CDATA[Intel SW Partner Program]]></category>
		<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/06/26/assessing-the-accelerator-buzz-tips-and-tricks-for-intel-compiler-vectorization/</guid>
		<description><![CDATA[Here at Intel we have spent much of the last year assessing the rising buzz about GPGPU’s and other accelerator cards in the financial services community.  These technologies promise tremendous computing capability, but often we see performance claims that are exaggerated by comparing the best possible accelerator implementation to a very unoptimal version of the [...]]]></description>
			<content:encoded><![CDATA[<p>Here at Intel we have spent much of the last year assessing the rising buzz about GPGPU’s and other accelerator cards in the financial services community.  These technologies promise tremendous computing capability, but often we see performance claims that are exaggerated by comparing the best possible accelerator implementation to a very unoptimal version of the software running on the CPU cores.</p>
<p>One of the first things we do in working toward a true top-end performance measurement is to rebuild the code with the Intel® Compiler.  Whether comparing against gcc 3.x or Microsoft Visual C++ compilations we often see a considerable performance gain right out of the box.</p>
<p>Beyond that we typically find further improvements by analyzing the source code hotspots and making adjustments to enable even better code generation from the Intel® Compiler.  The biggest gains often come from finding frequently executed computational loops that are not being vectorized, that is, they are not effectively using the SIMD capabilities of the x86 instruction set.  In this entry we’ll look at a couple examples where a simple tweak to a loop allowed it to vectorize and execute much more quickly.</p>
<p>In the first example, we have a “daxpy” loop that should vectorize using packed SSE3 instructions:</p>
<p><code>44: void DaxpyArray (double *x, double *y, double a, double *r)<br />
45: {<br />
46:<br />
47:      for (unsigned int i = 0; i &lt; LENGTH; i++)<br />
48:           r[i] = a * x[i] + y[i];<br />
49:<br />
50: }</code></p>
<p>We’ll compile this with icc 10.1, using the ‘-xP’ switch to target processors supporting SSE3 instructions and add ‘-vec_report3’ to get an explanation for loops that did not vectorize:</p>
<p><code>$ icc -xP -vec_report3 SSE3_example.C</code><br />
<code><br />
SSE3_example.C(47): (col. 5) remark: loop was not vectorized: existence of vector dependence.<br />
SSE3_example.C(48): (col. 8) remark: vector dependence: proven FLOW dependence between r line 48, and x line 48.<br />
SSE3_example.C(48): (col. 8) remark: vector dependence: proven ANTI dependence between x line 48, and r line 48.<br />
etc.</code></p>
<p>This loop looks simple enough, so why didn’t it vectorize? The reports can be cryptic but from experience we know that the compiler can be pretty picky about what it will accept. It requires a very specific ‘for’ loop structure and in this case the “unsigned int” loop counter is throwing it off course. The simplest fix is to change the type to a regular signed ‘int’. This will be fine as long as LENGTH does not exceed the range of 32-bit signed integers.</p>
<p><code>47: for (int i = 0; i &lt; LENGTH; i++)<br />
48: r[i] = a * x[i] + y[i];</code><br />
This was worth a 1.6x gain as measured on a 3.0 Ghz “Woodcrest” CPU, so a little code change can go a long way. Of course we’d like to see the compiler be less stringent and vectorize with an unsigned loop index. This has already been fixed in the compiler mainline builds and will most likely appear in the 11.0 release. We’re also working on making the vectorization reports more intuitive and useful.</p>
<p><code>$ icc -xP -vec_report3 SSE3_example.C<br />
...<br />
SSE3_example.C(47): (col. 5) remark: LOOP WAS VECTORIZED.<br />
...</code></p>
<p>A second example involves the use of STL vector container classes. The compiler doesn’t yet know how to vectorize loops referencing data in that fashion. For example, the following:</p>
<p><code>void DaxpyVector (const vector&amp; x, const vector&amp; y, const double a, vector&amp; r)<br />
{<br />
for (int i = 0; i &lt; LENGTH; i++)<br />
r[i] = a * x[i] + y[i];<br />
}<br />
</code><br />
must be recoded this way:<br />
<code>void DaxpyVector (const vector&amp; x, const vector&amp; y, const double a, vector&amp; r)<br />
{<br />
double *xP, *yP, *rP;</code></p>
<p><code>xP = (double*)&amp;x[0];<br />
yP = (double*)&amp;y[0];<br />
rP = &amp;r[0];</p>
<p>for (int i = 0; i &lt; LENGTH; i++)<br />
rP[i] = a * xP[i] + yP[i];</p>
<p></code></p>
<p><code>}<br />
</code><br />
... reassigning the vectors to double* pointers so the compiler can emit vectorized code. A fix for this issue is a bit more complex and we are not yet sure when the compilers will be able to handle it optimally.</p>
<p>A few other things that can trip up the vectorizer:</p>
<ul>
<li>use of class member variables in the loop, either as array pointers or loop count variables. Either reassign them to local stack variables or experiment with no aliasing options like /Oa, /Qansi-alias, /Ow.</li>
<li>any manipulation of vector classes, for example, size() and resize() calls, even outside the target loop can disable vectorization because of C++ exception handling anomalies. Either move them outside the function scope or try using the ‘–fno-exceptions’ switch.</li>
</ul>
<table cellpadding="5" cellspacing="0" rules="none" border="1">
<tbody>
<tr>
<th align="left" valign="middle" style="background-color: #555555; height: 30px; color: white;">Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>
<p>Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors.  In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors.  For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options."  Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors.  While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.</p>
<p>Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.  These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations.  Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.  Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.</p>
<p>While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements.  We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.</p>
<p>Notice revision #20101101</p>
</td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/06/26/assessing-the-accelerator-buzz-tips-and-tricks-for-intel-compiler-vectorization/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

