<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blogs &#187; Victoria Zhislina (Intel)</title>
	<atom:link href="http://software.intel.com/en-us/blogs/author/victoria-zhislina/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<lastBuildDate>Fri, 25 May 2012 22:49:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>-Mr Compiler, may I help you with the loop vectorization? -Not a disservice, please.</title>
		<link>http://software.intel.com/en-us/blogs/2011/10/12/mr-compiler-may-i-help-you-with-the-loop-vectorization-not-a-disservice-please/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/10/12/mr-compiler-may-i-help-you-with-the-loop-vectorization-not-a-disservice-please/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 09:42:56 +0000</pubDate>
		<dc:creator>Victoria Zhislina (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Performance and Optimization]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Auto-vectorization]]></category>
		<category><![CDATA[c++ Intel compiler]]></category>
		<category><![CDATA[Intel Compiler]]></category>
		<category><![CDATA[loop vectorization]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/10/12/mr-compiler-may-i-help-you-with-the-loop-vectorization-not-a-disservice-please/</guid>
		<description><![CDATA[Any parent knows the simple rule: "Never help a child with a task he can succeed at himself. Otherwise you don't make any good for the kid,  for you and for the whole planet". While a compiler is not a child (actually it is - because Intel  C/C++ Compiler is less than 16 years  old yet), the [...]]]></description>
			<content:encoded><![CDATA[<p>Any parent knows the simple rule: "Never help a child with a task he can succeed at himself. Otherwise you don't make any good for the kid,  for you and for the whole planet".<br />
While a compiler is not a child (actually it is - because Intel  C/C++ Compiler is less than 16 years  old yet), the rule is fully applicable to it as well.</p>
<p><span id="more-37070"></span></p>
<p>To prove it let's look at the following simple code from the open source <a href="http://opencv.willowgarage.com/wiki/">OpenCV library</a>.</p>
<pre name="code" class="cpp">template&lt;typename T, class Op&gt; static void
cvtScale_( const Mat&amp; srcmat, Mat&amp; dstmat, double _scale, double _shift )
{
    Op op;
    typedef typename Op::type1 WT;
    typedef typename Op::rtype DT;
    Size size = getContinuousSize( srcmat, dstmat, srcmat.channels() );
    WT scale = saturate_cast&lt;WT&gt;(_scale), shift = saturate_cast&lt;WT&gt;(_shift);

    for( int y = 0; y &lt; size.height; y++ )
    {
        const T* src = (const T*)(srcmat.data + srcmat.step*y);
        DT* dst = (DT*)(dstmat.data + dstmat.step*y);
        int x = 0;
        for(; x &lt;= size.width - 4; x += 4 )
        {
            DT t0, t1;
            t0 = op(src[x]*scale + shift);
            t1 = op(src[x+1]*scale + shift);
            dst[x] = t0; dst[x+1] = t1;
            t0 = op(src[x+2]*scale + shift);
            t1 = op(src[x+3]*scale + shift);
            dst[x+2] = t0; dst[x+3] = t1;
        }
        for( ; x &lt; size.width; x++ )
            dst[x] = op(src[x]*scale + shift);

      }
}</pre>
<p>It is a template function working with chars, ints, shorts, floats and doubles.<br />
And as you could see its authors decided to help to the compiler with the loop vectorization by unrolling the internal "x" loop by 4 (and processing the remaining data tail separately).</p>
<p>So, do you think the loop will be vectorized by the modern optimizing compilers properly?</p>
<p>Let's check it using  the Intel Compiler 12.0 with /QxSSE2 optimization option (using other SSEx or AVX option gives the same result as below).</p>
<p>And  the compiler generated assembly output is very surprising:  The compiler produces some SSE instructions however only scalar not the vector ones.  The unrolled loop is NOT vectorized, but  the remaining data tail, containing 1-3 elements in not unrolled loop, is vectorized!</p>
<p>If we remove the unrolling making our code simple :</p>
<pre name="code" class="cpp">for( int y = 0; y &lt; size.height; y++ )
    {
        const T* src = (const T*)(srcmat.data + srcmat.step*y);
        DT* dst = (DT*)(dstmat.data + dstmat.step*y);
        int x = 0;
        for( ; x &lt; size.width; x++ )
            dst[x] = op(src[x]*scale + shift);
    }</pre>
<p> ... and check the asm output again we find the compiler does fully vectorize the code resulting in performance increase up to 2-4 times depending on the input data type!<br />
<strong>Conclusion</strong>: <strong>More work for unrolling - lower performance. Don't do it.</strong></p>
<p>Please notice that Microsoft Compiler, Visual Studio 2010 and 2008 with /arch:SSE2 option does NOT vectorize the code above neither unrolled no the compact one.The code produced in both cases is very similar in appearence and performance. It just confirms the conclusion above.</p>
<p>And what if you want to keep unrolling for some reason but still need get the vectorization=performance desired?<br />
Then use the Intel compiler pragmas as shown below:<br />
----------------------------------------------------------<br />
<strong><span style="color: #0000ff;">﻿#pragma simd</span></strong><br />
   for(x=0; x &lt;= size.width - 4; x += 4 )<br />
        {<br />
            DT t0, t1;<br />
            t0 = op(src[x]*scale + shift);<br />
            t1 = op(src[x+1]*scale + shift);<br />
            dst[x] = t0; dst[x+1] = t1;<br />
            t0 = op(src[x+2]*scale + shift);<br />
            t1 = op(src[x+3]*scale + shift);<br />
            dst[x+2] = t0; dst[x+3] = t1;<br />
        }<br />
<span style="color: #0000ff;"><strong>#pragma novector</strong></span><br />
        for( ; x &lt; size.width; x++ )<br />
            dst[x] = op(src[x]*scale + shift);<br />
      }</p>
<p>-----------------------------------------</p>
<p>It is self explaining, isn't it?</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/10/12/mr-compiler-may-i-help-you-with-the-loop-vectorization-not-a-disservice-please/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Is AVX enabled?</title>
		<link>http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled/#comments</comments>
		<pubDate>Thu, 14 Apr 2011 11:44:50 +0000</pubDate>
		<dc:creator>Victoria Zhislina (Intel)</dc:creator>
				<category><![CDATA[Performance and Optimization]]></category>
		<category><![CDATA[AVX]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled/</guid>
		<description><![CDATA[If we ask anyone who  uses or plans to use or just advertises the intrinsic compiler functions for SIMD support (MMX, SSE, AVX):  why do you do so, why it is good? The answer definitely will be something like this:"Intrinsics provide a C/C++ language interface to assembly instructions, so that we don't need to deal with assembler". [...]]]></description>
			<content:encoded><![CDATA[<p><span style="font-family: Arial;font-size: x-small">If we ask anyone who  uses or plans to use or just advertises the intrinsic compiler functions for SIMD support (MMX, SSE, AVX):  why do you do so, why it is good? The answer definitely will be something like this:"Intrinsics provide a C/C++ language interface to assembly instructions, so that we don't need to deal with assembler".</span></p>
<p><span style="font-family: Arial"><span style="font-size: x-small"> Sounds more than good... but it is not true unfortunately. Whereas intrinsics do make the use of cpu specific enhancements easier significantly, they don't eliminate the need to do some asm programming  entirely. The problem is that intrinsics don't provide the fallback path for the systems without corresponding SIMD support - if the "intrinsics inside" program is executed on such CPU,  it crashes.  To prevent it one needs to  create the "generic code"  path and switch to\from it  depending on the host system SIMD support. And to detect this support trusty and reliably  it is necessary to use assembler! No other common solution is available yet... And the most burning issue here is the Advanced Vector Extensions support detection - the AVX is not widespread yet and requires the OS support. <span id="more-33289"></span></span></span><span style="font-family: Arial;font-size: x-small">While the Inel AVX Programming Reference contains the asm pseudocode for AVX support detection, it is not enough. What is wanted by the most developers is some cut-and-paste  code that could be used as is even without  any assembler knowleage. And such code is available - see below. </span></p>
<p><span style="font-family: Arial;font-size: x-small">The asm syntax differs for the 32 and 64 bits, so you need to include the corresponding version in your project and in both cases in C\C++ code call the  <span style="font-family: Arial"><span style="font-size: x-small"><span style="font-size: x-small">isAvxSupported() function - to return 1 if  AVX is supported or zero otherwise</span></span></span></span><span style="font-family: Arial;font-size: x-small"> </span></p>
<p><span style="font-family: Arial;font-size: x-small"> </span><span style="font-family: Arial"><span style="font-size: x-small"><span style="color: #0000ff;font-size: x-small"><span style="color: #0000ff;font-size: x-small"><span style="color: #0000ff;font-size: x-small">extern <span style="color: #a31515;font-size: x-small"><span style="color: #a31515;font-size: x-small">C"</span></span><span style="font-size: x-small"> </span><span style="color: #0000ff;font-size: x-small"><span style="color: #0000ff;font-size: x-small">int </span></span></span></span></span><span style="font-size: x-small">isAvxSupported(); </span></span></span></p>
<p><span style="color: #0000ff;font-size: x-small"><span style="color: #0000ff;font-size: x-small"><span style="color: #0000ff;font-size: x-small"> </span></span><span style="color: #000000;font-size: x-small">AVXsupport = isAvxSupported();  </span><span style="font-size: x-small"><span style="color: #008000;font-size: x-small"> // = one if supported and zero otherwise</span></span></span></p>
<p><span style="color: #0000ff;font-size: x-small"><span style="color: #0000ff;font-size: x-small"><span style="color: #0000ff;font-size: x-small"><span style="font-size: x-small"><span style="color: #008000;font-size: x-small"> </span></span></span></span></span><span style="color: #008000">----------------------------------------cpuid64.asm----------------------------------------------------</span></p>
<pre name="code" class="shell">; CPUID Win64
.code   

; int isAvxSupported();
isAvxSupported proc
    xor eax, eax
    cpuid
    cmp eax, 1 ; does CPUID support eax = 1?
    jb not_supported
    mov eax, 1
    cpuid
    and ecx, 018000000h ;check 27 bit (OS uses XSAVE/XRSTOR)
    cmp ecx, 018000000h ; and 28 (AVX supported by CPU)
    jne not_supported
    xor ecx, ecx ;  XFEATURE_ENABLED_MASK/XCR0 register number = 0
    xgetbv ;  XFEATURE_ENABLED_MASK register is in edx:eax
    and eax, 110b
    cmp eax, 110b ; check the AVX registers restore at context switch
    jne not_supported
    mov eax, 1
    ret
not_supported:
    xor eax, eax
    ret
isAvxSupported endp
	END</pre>
<p><span style="color: #008000">------------------------cpuid32.asm--------------------------</span></p>
<pre name="code" class="shell">.686p
.xmm
.model  FLAT

; CPUID Win32
.code   

; int isAvxSupported();
_isAvxSupported proc
    xor eax, eax
    cpuid
    cmp eax, 1 ; does CPUID support eax = 1?
    jb not_supported
    mov eax, 1
    cpuid
    and ecx, 018000000h ;check 27 bit (OS uses XSAVE/XRSTOR)
    cmp ecx, 018000000h ; and 28 (AVX supported by CPU)
    jne not_supported
    xor ecx, ecx ;  XFEATURE_ENABLED_MASK/XCR0 register number = 0
    xgetbv ;  XFEATURE_ENABLED_MASK register is in edx:eax
    and eax, 110b
    cmp eax, 110b ; check the AVX registers restore at context switch
    jne not_supported
    mov eax, 1
    ret
not_supported:
    xor eax, eax
    ret
_isAvxSupported endp
	END</pre>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

