<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Sat, 26 May 2012 04:03:52 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/intel-parallel-composer-kb/type/performance-and-optimization/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/intel-parallel-composer-kb/type/performance-and-optimization/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Getting Started with Intel® Cilk™ Plus Array Notations</title>
      <description><![CDATA[ <strong><span >Introduction<br /><br /></span></strong>Array Notations is an Intel-specific language extension that is a part of <a href="http://software.intel.com/en-us/articles/intel-cilk-plus/">Intel® Cilk<sup>TM</sup> Plus</a> feature supported by the Intel® C++ Compiler that provides ways to express data parallel operation on ordinary declared C/C++ arrays.  By using array notations, you can improve the performance of your application through <a href="http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/?wapkw=%28vectorization+with+intel+compilers%29">Vectorization</a>.  Vectorization is the key to improving your applications' performance through taking advantage of the processor's capability to operate on multiple array (or vector) elements at a time.  The Intel® Compilers provide unique capabilities to enable vectorization. The programmer may be able to help the compiler to vectorize more loops through a simple programming style and by the use of compiler features designed to assist vectorization.  This article discusses how to use the Array Notations feature from the Intel® Cilk<sup>TM</sup> Plus, to help the compiler to vectorize C/C++ code and improve performance.<br /><br /><a href="http://software.intel.com/file/42927">Click here to continue reading the article.</a> ]]></description>
      <link>http://software.intel.com/en-us/articles/getting-started-with-intel-cilk-plus-array-notations/</link>
      <pubDate>Sun, 25 Mar 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/getting-started-with-intel-cilk-plus-array-notations/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/getting-started-with-intel-cilk-plus-array-notations/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
    <item>
      <title>Getting Started with Intel® Cilk™ Plus SIMD Vectorization and Elemental Functions</title>
      <description><![CDATA[ <strong><span >Introduction<br /></span></strong><br />SIMD Vectorization and Elemental Functions are a part of <a href="http://software.intel.com/en-us/articles/intel-cilk-plus/">Intel® Cilk<sup>TM</sup> Plus</a> feature supported by the Intel® C++ Compiler that provide ways to vectorize loops and user defined functions.  <a href="http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/?wapkw=%28vectorization+with+intel+compilers%29">Vectorization </a>is the key to improving your applications' performance through taking advantage of the processor's Single Instruction Multiple Data (SIMD) capability to operate on multiple array (or vector) elements at a time.  The Intel® Compilers provide unique capabilities to enable vectorization. The programmer may be able to help the compiler to vectorize more loops through a simple programming style and by the use of compiler features designed to assist vectorization.  This article discusses how to use the vector elemental functions, and the SIMD directive (#pragma simd) from the Intel® Cilk<sup>TM</sup> Plus, to help the compiler to vectorize C/C++ code and improve performance.<br /><br /><a href="http://software.intel.com/file/42996">Click here to continue reading the article.</a><br /><br />Additional information about what sort of loops may be vectorized using the SIMD pragma/directive is available <a href="http://software.intel.com/en-us/articles/requirements-for-vectorizing-loops-with-pragma-simd/">here</a>. ]]></description>
      <link>http://software.intel.com/en-us/articles/getting-started-with-intel-cilk-plus-simd-vectorization-and-elemental-functions/</link>
      <pubDate>Tue, 20 Mar 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/getting-started-with-intel-cilk-plus-simd-vectorization-and-elemental-functions/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/getting-started-with-intel-cilk-plus-simd-vectorization-and-elemental-functions/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
    <item>
      <title>Performance Tools for Software Developers - Auto parallelization and  /Qpar-threshold</title>
      <description><![CDATA[ <!--CTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dt-->
<table cellpadding="0" cellspacing="15" border="0">
<tbody>
<tr>
<td class="bodycopy">
<p>The auto-parallelization feature of the Intel C++ Compiler automatically translates serial portions of the input program into semantically equivalent multithreaded code. Automatic parallelization determines the loops that are good work sharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as is needed in programming with OpenMP directives. The OpenMP and Auto-parallelization applications provide the performance gains from shared memory on multiprocessor systems, IA-32 and  Intel 64.</p>
<p>The following table lists the options that enable Auto-parallelization:</p>
<blockquote><b>/Qparallel:</b><br />Enables the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. <br /><br /><b>/Qpar-threshold:n</b><br />This option sets a threshold for the auto-parallelization of loops based on the probability of profitable execution of the loop in parallel. To use this option, you must also specify -parallel (Linux and Mac OS X) or /Qparallel (Windows). The default is /Qpar-threshold:100.</blockquote>
<p>This option is useful for loops whose computation work volume cannot be determined at compile-time. The threshold is usually relevant when the loop trip count is unknown at compile-time.</p>
<p>The compiler applies a heuristic that tries to balance the overhead of creating multiple threads versus the amount of work available to be shared amongst the threads.</p>
<p>The n is an integer whose value is the threshold for the auto-parallelization of loops. Possible values are 0 through 100. If <i>n</i> is 0, loops get auto-parallelized always, regardless of computation work volume. If <i>n</i> is 100, loops get auto-parallelized when performance gains are predicted based on the compiler analysis data. Loops get auto-parallelized only if profitable parallel execution is almost certain. The intermediate 1 to 99 values represent the percentage probability for profitable speed-up. For example, <i>n</i>=50 directs the compiler to parallelize only if there is a 50% probability of the code speeding up if executed in parallel.</p>
<p>Also, to be "100%" sure that a loop will benefit from parallelization, the compiler needs to know the iteration count at compile time. For a "99%" or lower threshold, knowing the iteration count at compile time is not a requirement.</p>
<p>This leads to a big difference in the number of loops parallelized at 99% compared to 100%. For many apps, 99% is a better setting, but for some apps with a lot of short loops, 99% will slow them down.</p>
<p>The following example, int_sin.c, does not auto parallelize when we use /Qpar-threshold:100 using command line below :</p>
<blockquote>C: &gt;icl -c /Qparallel /Qpar-report3 /Qpar-threshold:100 int_sin.c
<p>If we use /Qpar-threshold:99 then it is parallelized.</p>
<p><b>Example:</b></p>
<p class="whs23" ><b ></b></p>
<p class="MsoNormal" ><span >// int_sin.c</span></p>
<p class="MsoNormal" ><span >// Intel C++ compiler sample program</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span >#include</span><span ><span >&lt;stdio.h&gt;</span></span></p>
<p class="MsoNormal" ><span >#include</span><span ><span >&lt;stdlib.h&gt;</span></span></p>
<p class="MsoNormal" ><span >#include</span><span ><span >&lt;time.h&gt;</span></span></p>
<p class="MsoNormal" ><span >#include</span><span ><span >&lt;mathimf.h&gt;</span></span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span >// Function to be integrated</span></p>
<p class="MsoNormal" ><span >// Define and prototype it here</span></p>
<p class="MsoNormal" ><span >// | sin(x) |</span></p>
<p class="MsoNormal" ><span >#define </span><span >INTEG_FUNC(x) fabs(sin(x))</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span >// Prototype timing function</span></p>
<p class="MsoNormal" ><span >double </span><span >dclock( <span >void</span>);</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span >int </span><span >main( <span >void</span>)</span></p>
<p class="MsoNormal" ><span >{</span></p>
<p class="MsoNormal" ><span ><span >// Loop counters and number of interior points</span></span></p>
<p class="MsoNormal" ><span ><span >unsigned </span><span >int</span> i, j, N;</span></p>
<p class="MsoNormal" ><span ><span >// Stepsize, independent variable x, and accumulated sum</span></span></p>
<p class="MsoNormal" ><span ><span >double</span> step, x_i, sum;</span></p>
<p class="MsoNormal" ><span ><span >// Timing variables for evaluation </span></span></p>
<p class="MsoNormal" ><span ><span >double</span> start, finish, duration, clock_t;</span></p>
<p class="MsoNormal" ><span ><span >// Start integral from</span></span></p>
<p class="MsoNormal" ><span ><span >double</span> interval_begin = 0.0;</span></p>
<p class="MsoNormal" ><span ><span >// Complete integral at</span></span></p>
<p class="MsoNormal" ><span ><span >double</span> interval_end = 2.0 * 3.141592653589793238;</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Start timing for the entire application</span></span></p>
<p class="MsoNormal" ><span >start = clock();</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span >printf( <span >" "</span>);</span></p>
<p class="MsoNormal" ><span >printf( <span >" Number of | Computed Integral | "</span>);</span></p>
<p class="MsoNormal" ><span >printf( <span >" Interior Points | | "</span>);</span></p>
<p class="MsoNormal" ><span ><span >for</span> (j=2;j&lt;10;j++)</span></p>
<p class="MsoNormal" ><span >{</span></p>
<p class="MsoNormal" ><span >printf( <span >"------------------------------------- "</span>);</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Compute the number of (internal rectangles + 1)</span></span></p>
<p class="MsoNormal" ><span >N = 1 &lt;&lt; j;</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Compute stepsize for N-1 internal rectangles</span></span></p>
<p class="MsoNormal" ><span >step = (interval_end - interval_begin) / N;</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Approx. 1/2 area in first rectangle: f(x0) * [step/2]</span></span></p>
<p class="MsoNormal" ><span >sum = INTEG_FUNC(interval_begin) * step / 2.0;</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Apply midpoint rule:</span></span></p>
<p class="MsoNormal" ><span ><span >// Given length = f(x), compute the area of the</span></span></p>
<p class="MsoNormal" ><span ><span >// rectangle of width step</span></span></p>
<p class="MsoNormal" ><span ><span >// Sum areas of internal rectangle: f(xi + step) * step</span></span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >for</span> (i=1;i&lt;N;i++)</span></p>
<span >{</span>
<p class="MsoNormal" ><span >x_i = i * step;</span></p>
<p class="MsoNormal" ><span >sum += INTEG_FUNC(x_i) * step;</span></p>
<p class="MsoNormal" ><span >}</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Approx. 1/2 area in last rectangle: f(xN) * [step/2]</span></span></p>
<p class="MsoNormal" ><span >sum += INTEG_FUNC(interval_end) * step / 2.0;</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span lang="IT" >printf( <span >" %10d | %14e | "</span>, N, sum);</span></p>
<p class="MsoNormal" ><span >}</span></p>
<p class="MsoNormal" ><span >finish = clock();</span></p>
<p class="MsoNormal" ><span >duration = (finish - start);</span></p>
<p class="MsoNormal" ><span >printf( <span >" "</span>);</span></p>
<p class="MsoNormal" ><span >printf( <span >" Application Clocks = %10e "</span>, duration);</span></p>
<p class="MsoNormal" ><span >printf( <span >" "</span>);</span></p>
<p class="MsoNormal" ><span ><span >}</span></span></p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td><img height="5" width="388" src="http://software.intel.com/file/6324" /></td>
</tr>
<tr>
<td height="10"></td>
</tr>
</tbody>
</table> ]]></description>
      <link>http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/</link>
      <pubDate>Sun, 23 Jan 2011 10:30:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
    <item>
      <title>How to manually target 2nd generation Intel Core processors with support for Intel AVX</title>
      <description><![CDATA[ <p><br /><strong>Product :</strong> Intel C++ Composer XE<br /><br /><strong>Version :</strong> 2011  (contains Intel C++ Compiler 12.0 or 12.1)<br /><br /><br />Manual processor dispatch allows you to write one or more versions of a function that will run only on specified types of Intel processor. The Intel processor type is detected at runtime, and the corresponding function version is executed. This feature is available only for Intel processors of IA-32 or Intel 64 architecture. It is not available for non-Intel processors nor for Intel processors of IA-64 architecture. Applications built with the manual processor dispatch feature may be more highly optimized for Intel processors than for non-Intel processors.<br /><br />The  <strong>__declspec(cpu_ dispatch(cpuid,cpuid,…))</strong>  syntax is used to provide a list of targeted processors along with an empty function body (i.e., a function stub). The <strong>__declspec(cpu_specific(cpuid))</strong> syntax is used to declare each function version that is targeted at a particular type or types of processor.<br /><br />The following table lists possible values for cpuid (names are not case-sensitive):<br /><br />
<table width="100%" cellpadding="0" cellspacing="0" border="1">
<thead>
<tr>
<td width="24%" valign="top">
<p align="center"><b>Argument for cpuid</b></p>
</td>
<td width="75%" valign="top">
<p align="center"><b>Processors</b></p>
</td>
</tr>
</thead>
<tbody>
<tr>
<td width="24%" valign="top">
<p>core_2nd_gen_avx</p>
</td>
<td width="75%" valign="top">
<p>2nd generation Intel® Core<sup>TM</sup> processor family with support for Intel® Advanced Vector Extensions (Intel® AVX).</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>core_aes_pclmulqdq</p>
</td>
<td width="75%" valign="top">
<p>Intel® Core<sup>TM</sup>  processors with support for Advanced Encryption Standard (AES) instructions and carry-less multiplication instruction</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>core_i7_sse4_2</p>
</td>
<td width="75%" valign="top">
<p>Intel® Core<sup>TM</sup>  processor family with support for Intel® SSE4 Efficient Accelerated String and Text Processing instructions  (SSE4.2)</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>atom</p>
</td>
<td width="75%" valign="top">
<p>Intel® Atom<sup>TM</sup> processors</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>core_2_duo_sse4_1</p>
</td>
<td width="75%" valign="top">
<p>Intel® 45nm Hi-k next generation Intel® Core<sup>TM</sup> microarchitecture processors with support for Intel® SSE4 Vectorizing Compiler and Media Accelerators instructions (SSE4.1)</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>core_2_duo_ssse3</p>
</td>
<td width="75%" valign="top">
<p>Intel® Core<sup>TM</sup>2 Duo processors and Intel® Xeon® processors with Intel® Supplemental Streaming SIMD Extensions 3 (SSSE3)</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>pentium_4_sse3</p>
</td>
<td width="75%" valign="top">
<p>Intel® Pentium 4 processor with Intel® Streaming SIMD Extensions 3 (Intel® SSE3), Intel® Core<sup>TM</sup> Duo processors, Intel® Core<sup>TM</sup> Solo processors</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>pentium_4</p>
</td>
<td width="75%" valign="top">
<p>Intel® Intel Pentium 4 processors</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>pentium_m</p>
</td>
<td width="75%" valign="top">
<p>Intel® Pentium M processors</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>pentium_iii</p>
</td>
<td width="75%" valign="top">
<p>Intel® Pentium III processors</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>generic</p>
</td>
<td width="75%" valign="top">
<p>Other IA-32 or Intel 64 processors or compatible  processors not provided by Intel Corporation</p>
</td>
</tr>
</tbody>
</table>
<br /><br />If no other matching Intel processor type is detected, the “generic” version of the function will be executed. If the program is intended to execute on non-Intel processors, a “generic” function version must be provided. The degree of optimization of the generic function version and the processor features that it assumes are under the control of the programmer.<br /><br />The following framework illustrates how the <strong>cpu_dispatch</strong> and <strong>cpu_specific</strong> keywords might be used to create function versions for the 2nd generation Intel Core processor family, for the Intel Core processor family, for the Intel Core 2 Duo processor family, and for other Intel and compatible, non-Intel processors. Each processor-specific function body might contain processor-specific intrinsic functions, or it might be placed in a separate source file and compiled with a processor-specific compiler option. See <a href="http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/">http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/</a> for more details of such options.</p>
<pre name="code" class="cpp"><br />#include &lt;stdio.h&gt;

// need to create specific function versions for the following processors:
__declspec(cpu_dispatch(generic, core_2_duo_ssse3, core_i7_sse4_2, core_2nd_gen_avx))
void dispatch_func() {};      //  stub that will call the appropriate specific function version

__declspec(cpu_specific(generic))
void dispatch_func() {
printf("\nCode for non-Intel processors and generic Intel processors goes here\n");
}

__declspec(cpu_specific(core_2_duo_ssse3))
void dispatch_func() {
printf("\nCode for Intel Core 2 Duo processors with support for SSSE3 goes here\n");
}

__declspec(cpu_specific(core_i7_sse4_2))
void dispatch_func() {
printf("\nCode for Intel Core processors with support for SSE4.2 goes here\n");
}

__declspec(cpu_specific(core_2nd_gen_avx))
void dispatch_func() {
printf("\nCode for 2nd generation Intel Core processors goes here\n");
}

int main() {
dispatch_func();
printf("Return from dispatch_func\n");
return 0;
}
</pre>
<p><br /><br />
<table cellpadding="5" cellspacing="0" rules="none" border="1">
<tbody>
<tr>
<th align="left" valign="middle" >Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>
<p>Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.</p>
<p align="right">Notice revision #20110804</p>
</td>
</tr>
</tbody>
</table>
</p>
<p> </p>
<p><i>[DISCLAIMER: The information on this web site is intended for hardware system manufacturers and software developers. Intel does not warrant the accuracy, completeness or utility of any information on this site. Intel may make changes to the information or the site at any time without notice. Intel makes no commitment to update the information at this site. ALL INFORMATION PROVIDED ON THIS WEBSITE IS PROVIDED "as is" without any express, implied, or statutory warranty of any kind including but not limited to warranties of merchantability, non-infringement of intellectual property, or fitness for any particular purpose. Independent companies manufacture the third-party products that are mentioned on this site. Intel is not responsible for the quality or performance of third-party products and makes no representation or warranty regarding such products. The third-party supplier remains solely responsible for the design, manufacture, sale and functionality of its products. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others.]</i></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/how-to-manually-target-2nd-generation-intel-core-processors/</link>
      <pubDate>Thu, 13 Jan 2011 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/how-to-manually-target-2nd-generation-intel-core-processors/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/how-to-manually-target-2nd-generation-intel-core-processors/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
    <item>
      <title>Step-by-Step Application Performance Tuning with Intel Compilers</title>
      <description><![CDATA[ <span class="sectionHeading">Application Performance:  A Step-by-Step Introduction to Application Tuning with Intel® Compilers</span><br /><br /><span class="sectionBodyText">Before you begin performance tuning, you may want to check the correctness of your application by building it without optimization using /Od (Windows*) or -O0 (Linux* or Mac OS* X). In compiler versions 11 and later, all optimization levels assume support for the SSE2 instruction set by default. <br /><br /><span class="sectionHeading">1. </span>Use the general optimization options (Windows /O1, /O2 or /O3; Linux and Mac OS X -O1, -O2, or -O3) and determine which one works best for your application by measuring performance with each. Most users should start at /O2 (–O2), the default, before trying more advanced optimizations. Next, for loop-intensive applications, try /O3 (-O3).  These options are available for both Intel® and non-Intel microprocessors but they may perform more optimizations for Intel microprocessors than they perform for non-Intel microprocessors.<br /><br /><span class="sectionHeading">2.</span> Fine-tune performance to target IA-32 and Intel 64-based systems using processor-specific options. Examples are /QxSSE4.2 (–xsse4.2) for the Intel® Core™ processor family, e.g. the Intel Core i7 processor, and /arch:SSE3 (-msse3) for compatible, non-Intel processors that support at least the SSE3 instruction set. Alternatively, you can use /QxHOST (-xhost) which will use the most advanced instruction set for the processor on which you compiled. This option is available for both Intel® and non-Intel microprocessors but it may perform more optimizations for Intel microprocessors than it performs for non-Intel microprocessors. For a more extensive list and description of options that optimize for specific processors or instruction sets, please see the online article “<a href="http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/" title="SSE generation and processor-specific optimizations">Intel® compiler options for SSE generation and processor-specific optimizations</a>” and the Intel Compiler User and Reference Guides.<br /><br /><span class="sectionHeadingText">3.</span> Add interprocedural optimization (IPO), /Qipo (-ipo) and/or profile-guided optimization (PGO), /Qprof-gen and /Qprof-use (-prof-gen and -prof-use); then measure performance again to determine whether your application benefits from one or both of them.<br /><br /><span class="sectionHeadingText">4.</span> Optimize your application for vector and parallel execution on multi-threaded, multi-core and multi-processor systems using:<br />advice from the new Guided Auto-Parallelism (GAP) feature, /Qguide (-guide); <br />the Intel® Cilk™ Plus language extensions for C/C++;<br />the parallel performance options /Qparallel (-parallel) or /Qopenmp (-openmp);<br />the CoArray feature of Fortran 2008;<br />or by using the Intel® Performance Libraries included with the product. <br />These optimization steps are applicable to both Intel and non-Intel microprocessors, but may result in a greater performance gain on Intel microprocessors than on non-Intel microprocessors.<br /><br /><span class="sectionHeading">5.</span> Use Intel® VTune™ Amplifier XE to help you identify serial and parallel performance “hotspots” so that you know which specific parts of your application could benefit from further tuning. Use Intel® Inspector XE to reduce the time to market for threaded applications by diagnosing memory and threading errors and speeding up the development process. These products cannot be used on non-Intel microprocessors.<br /></span><br />For more details, please consult the main product documentation, e.g. in the <a href="http://software.intel.com/en-us/articles/intel-software-technical-documentation/">Intel® Software Documentation Library</a>. A brief summary of the major optimization options of the Intel Compiler is available in the <a href="http://software.intel.com/sites/products/collateral/hpc/compilers/compiler_qrg12.pdf" title="Quick-Reference Guide to Optimization with Intel® Compilers version 12">Quick-Reference Guide to Optimization with Intel® Compilers version 12</a>. ]]></description>
      <link>http://software.intel.com/en-us/articles/step-by-step-application-performance-tuning-with-intel-compilers/</link>
      <pubDate>Thu, 11 Nov 2010 21:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/step-by-step-application-performance-tuning-with-intel-compilers/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/step-by-step-application-performance-tuning-with-intel-compilers/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
      <category>Intel® VTune™ Amplifier XE Knowledge Base</category>
    </item>
    <item>
      <title>A Guide to Auto-vectorization with Intel® C++ Compilers</title>
      <description><![CDATA[ <div class="sectionHeading">Introduction</div>
<br />The goal of this Guide is to provide guidelines for enabling compiler auto-vectorization with the Intel® C++ Compilers.  This document  is aimed at C/C++ programmers working on systems based on Intel® processors or compatible, non-Intel processors that support SIMD instructions such as Intel® Streaming SIMD Extensions (Intel® SSE).  This includes Intel 64 and most IA-32 systems, but excludes systems based on Intel® Itanium® processors.  The examples presented refer to Intel SSE, but many of the principles apply also to other SIMD instruction sets.  While the examples used are specific to C++ programs, much of the concepts discussed are equally applicable to Fortran programs.<br /><br /><a href="http://software.intel.com/file/43567/">Click here to continue reading the article.</a><br /><br /><a href="http://software.intel.com/file/43565">Click here to download the sample code.</a><br /><br /><br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/</link>
      <pubDate>Mon, 08 Nov 2010 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
    <item>
      <title>Information about the FTC Decision and Order on the Intel® Compilers Reimbursement Fund</title>
      <description><![CDATA[ Information on the Intel Compiler Reimbursement Fund referenced in Section VII.D of the FTC Decision and Order is available now. Please see the site, <a href="http://www.CompilerReimbursementProgram.com">www.CompilerReimbursementProgram.com</a>, for further information. ]]></description>
      <link>http://software.intel.com/en-us/articles/information-about-the-ftc-decision-and-order-on-the-intel-compilers-reimbursement-fund/</link>
      <pubDate>Mon, 01 Nov 2010 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/information-about-the-ftc-decision-and-order-on-the-intel-compilers-reimbursement-fund/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/information-about-the-ftc-decision-and-order-on-the-intel-compilers-reimbursement-fund/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Software Development Tool Suites for Intel® Atom™ Processor Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Integrated Performance Primitives Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>Elemental functions: Writing data parallel code in C/C++ using Intel® Cilk™ Plus</title>
      <description><![CDATA[ Intel® Cilk™ Plus is a simple C/C++ language extension construct for data parallel operations.<br /><em></em><br />
<div class="sectionHeading">Introduction</div>
<br />Intel® Cilk™ Plus provides simple to use language extensions to express data and task-parallelism to the C and C++ language implemented by the Intel® C++ Compiler, which is part of Intel® Parallel Composer and Intel® Parallel Studio. This article describes one of these programming constructs: “elemental functions”.<br /><br />There are cases in which the algorithm performs operations on large collection of data, with no dependence among the operations being performed on the various data item. For example, the programmer may write, at a certain point of the algorithm: add arrays a1 and a2 and store the result in array r. When thinking at that level, the programmer is thinking in terms of a single operation that needs to be performed on many elements, independently of each other. Unfortunately, the C/C++ programming languages do not provide constructs that allow expressing the operation as such. Instead, they force the programmer to elaborate the operation in terms of a loop that operates on each array element. The end result, in terms of the values being stored in the result array, would be the same. However, the introduction of the loop introduces an unintended order of operations: The implementation has to add the first two array elements, store the results in the first location of the result array, move on to perform the same operation on the second set of elements, and so on. An example later will show what is expected to be a typical use of elemental function: use a single line to mark a standard C/C++ function as elemental, change the invocation loop from serial to parallel (in this example, it is a change of one keyword) and that is it.<br /><br />To continue reading the article, click on the link below.<br /><br /><br />Refer to<a href="http://software.intel.com/en-us/articles/optimization-notice/"> http://software.intel.com/en-us/articles/optimization-notice </a>for more information regarding performance and optimization choices in Intel software products. ]]></description>
      <link>http://software.intel.com/en-us/articles/elemental-functions-writing-data-parallel-code-in-cc-using-intel-cilk-plus/</link>
      <pubDate>Wed, 01 Sep 2010 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/elemental-functions-writing-data-parallel-code-in-cc-using-intel-cilk-plus/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/elemental-functions-writing-data-parallel-code-in-cc-using-intel-cilk-plus/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
    <item>
      <title>Accelerate Your Application via IPP Image Processing in Parallel Studio - C code vs. IPP Resize</title>
      <description><![CDATA[ <p align="left"><strong>Summary</strong><br />Intel®<strong> </strong>Parallel Studio 2011 release recently. IPP as one key component of Intel®<strong> </strong>Parallel Composer provide user a easy and faster way to accelarate digital application. This article shows how to employ IPP image processing function to develop parallel ready application and provide a sample to shows the performance difference between IPP and general C code on resizing image, which is wide-used functionality in image processing field. Test show that the IPP function can run 44x faster than corresponding C code. If enabling parallel, the speed up will high 50x on Core 2 Quad 2.66GHz machine. <br /><br /><a href="http://software.intel.com/file/29998"><strong>Attached</strong></a> is the sample project, one Parallel Composer 2011 project in MicroSoft Visual Studio 2005 IDE. <br />Some developers may install Intel Parallel Composer with Microsoft Visual Studio 2010. <a href="http://software.intel.com/file/32831"><strong>Here</strong></a> is the project. <br /><b><br />How to build the Sample</b></p>
<p>1. Build system requirement</p>
<p>Software:<br />•   <a href="http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-for-windows-compiling-and-linking-with-microsoft-visual-c-and-intel-c-compilers/#10#10">Intel Parallel Studio 2011 and Microsoft* Visual Studio 2005 and later</a><br />•   (optional)  install static ipp library separately from http://software.intel.com/en-us/articles/intel-ipp-static-libraries/ <b></b></p>
<p>Hardware:  The latest dual-core/Quad Core machine with Windows xp/Windows Vista/Windows 7</p>
<p>2. Download and Unzip the Resize_Image_PS_VS2005.zip to a directory, let's name &lt;WorkDIR&gt;</p>
<p>3. Go to &lt;WorkDIR&gt; and double click the Resize Image.sln.  The msvc2005 IDE will prompt automatically.</p>
<p>4. From the <b>main toolbar</b> select <b>Project&gt;&gt;</b> <b>Intel Parallel Composer 2011 »</b> <b>Select Build Component.</b></p>
<p>(or right-click the Project in Solution Explorer) , check <b>Use IPP. </b>click OK<b></b></p>
<p>5. Then build the application, from the <b>main toolbar</b> select<strong> Build &gt;&gt; Build solution<br /></strong><br />Please see the build details in<strong>  </strong><a href="http://software.intel.com/en-us/articles/use-intel-ipp-in-intel-parallel-composer/"><strong>Use Intel IPP in Intel® Parallel Composer</strong></a></p>
<p><b>How to run the application</b> </p>
<p>1. Run the application<br />From the <strong>main toolbar</strong>, select <b>Debug</b> &gt;&gt;<strong> Start Without Debugging. </strong>The application windows start, Click Open File, Select LennaC1.bmp <br /><strong><img src="http://software.intel.com/file/29994" alt="ReadLenna.JPG" title="ReadLenna.JPG" /></strong></p>
<p>2. click menu "Process =&gt; Resize image" to Resize the image. <br /> Enter the zoom factor in horizontal (x) and vertical (y) directory in Resize dialog box.  Click OK  <img src="http://software.intel.com/file/29995" alt="Process.JPG" title="Process.JPG" /></p>
<p>3: Click lennC1.bmp and repeat step 2 again, make sure click button USE_IPP. Then get the below image  <strong><img src="http://software.intel.com/file/29997" alt="result1.JPG" title="result1.JPG" /></strong></p>
<p><b>IPP Function Adoption: <br /></b>Assume the sample is the application we want to improve the performance via IPP function.  <br />1.  Find the c code resize image function in RESIZE.cpp</p>
<p>unsigned long C_Code_Resize(unsigned char * src, int srcWidth, int srcHeight,   int srcStep, unsigned char* dst, int dstWidth, int dstHeight, int dstStep, double m_zoom_x, double m_zoom_y, int interpolation)</p>
<p> It is called by function ProcessImage(CSampleDoc *pSrc) in ippiAddC.cpp<br /><br />2. Check ipp manual ippiman.pdf and find the function ippiResizeSqrPixel have same functionality.  Then replace the C function with IPP function.   <br />Declare a similiar function in RESIZE.cpp<br />unsigned long IPP_Resize( void* src, int srcWidth, int srcHeight,int srcStep,  void* dst,  int dstWidth, int dstHeight, int dstStep, double m_nzoom_x, double m_nzoom_y, int interpolation)</p>
<p align="left"> And call it in ProcessImage(CSampleDoc *pSrc) in ippiAddC.cpp instead of call C_Code_Resize().  (In order to compare the performance, we keep the c function call here.)</p>
<p> if (m_USE_IPP)<br />{<br />             ippStaticInit();<br />       //---- perform IPP Funtion Code to rotate a image  -----//<br />         run_time = IPP_Resize(pSrc-&gt;DataPtr(),pSrc-&gt;Width(),pSrc-&gt;Height(),pSrc-&gt;Step(),(Ipp8u*)pDst-&gt;DataPtr(),        pDst-&gt;Width(),pDst-&gt;Height(),pDst-&gt;Step(),m_zoom_x,m_zoom_y,m_Interpolation);<br />}<br />else{         //---- perform C Code to rotate a image  -----//<br />         run_time = C_Code_Resize((unsigned char *)pSrc-&gt;DataPtr(),pSrc-&gt;Width(),<br />         pSrc-&gt;Height(),pSrc-&gt;Step(), (unsigned char *)pDst-&gt;DataPtr(), pDst-&gt;Width(),pDst-&gt;Height(),pDst-&gt;Step(),m_zoom_x,m_zoom_y,m_Interpolation);<br />}     <br /><br />3. Write the IPP code to replace the C code.  The table show the original C code and the IPP code </p>
<p>
<table width="588" cellpadding="0" cellspacing="0" border="1">
<tbody>
<tr>
<td width="284" valign="top">
<p>Tthe C code</p>
</td>
<td width="304" valign="top">
<p>The IPP code</p>
</td>
</tr>
<tr>
<td width="284" valign="top">
<p>unsigned long C_Code_Resize(unsigned char * src, int srcWidth, int srcHeight,int srcStep, unsigned char* dst, int dstWidth, int dstHeight, int dstStep, double m_zoom_x, double m_zoom_y, int interpolation)</p>
<p align="left">{//---------- Perform 1 order linear ---<br />     //define record time variable<br />     unsigned long start_clock,stop_clock;    start_clock = RUNTIME;</p>
<p align="left">     const unsigned char *tmpSrc;<br />    unsigned char *tmpRef;<br />    int width = srcWidth;<br />    int height = srcHeight;<br />    double xInv = 1.0 /  m_zoom_x;<br />    double yInv = 1.0 /  m_zoom_y;</p>
<p align="left">    int colInd, rowInd;<br />    int i, j, xSrc0, xSrc1, ySrc0, ySrc1, wdroi, hdroi;<br />    int idxl, idyt, icol, jrow;<br />    double row, col;<br />    double y1, y2, y3, y4, v, v1, v2, tempV,tempV2;</p>
<p align="left">     idxl=0;<br />     idyt=0; <br />    wdroi = dstWidth;<br />    hdroi = dstHeight;</p>
<p align="left">     tmpSrc = src;<br />for(int kloop=0;kloop&lt;LOOP;kloop++) </p>
<p align="left">{  <br />  tmpRef = dst ;<br />    for (j = 0, jrow = idyt; j &lt; hdroi; j++, jrow++) {         row = (jrow + 0.5) * yInv - 0.5;</p>
<p align="left">        rowInd = (int)floor(row);<br />        ySrc0 = ts_iGetCoord_vs(rowInd, rowInd,  0, srcHeight, srcHeight);<br />        ySrc1 = ts_iGetCoord_vs(rowInd, rowInd + 1, 0, srcHeight, srcHeight);<br />        for (i = 0, icol = idxl; i &lt; wdroi; i++, icol++) { <br />            col = (icol + 0.5) * xInv - 0.5;<br />            colInd = (int)floor(col);<br />            xSrc0 = ts_iGetCoord_vs(colInd, colInd,   0, srcWidth, srcWidth);<br />            xSrc1 = ts_iGetCoord_vs(colInd, colInd + 1, 0, srcWidth, srcWidth);<br />            y1 = (double)tmpSrc[ySrc0 * srcStep + xSrc0];<br />            y2 = (double)tmpSrc[ySrc0 * srcStep + xSrc1];<br />            y3 = (double)tmpSrc[ySrc1 * srcStep + xSrc0];<br />            y4 = (double)tmpSrc[ySrc1 * srcStep + xSrc1];  <br /> ts_iLinearCalcSP_vs(col + 0.5, colInd + 0.5, colInd + 1.5, y1, y2, &amp;v1);            ts_iLinearCalcSP_vs(col + 0.5, colInd + 0.5, colInd + 1.5, y3, y4, &amp;v2);<br />ts_iLinearCalcSP_vs(row + 0.5, rowInd + 0.5, rowInd + 1.5, v1, v2, &amp;v);<br />              //(ts_isaturate_vs(v);<br />            tempV = (int)(v + EXP + 0.5);             tmpRef[i] =(unsigned char)((tempV &gt; 255) ? 255 : (tempV &lt; 0) ? 0 : tempV);<br />        }<br />        tmpRef += dstStep;<br />  }  <br />}</p>
<p align="left">     stop_clock = RUNTIME;</p>
<p align="left">     int mhz;</p>
<p align="left">    ippGetCpuFreqMhz(&amp;mhz);</p>
<p align="left">     return (stop_clock - start_clock)/mhz/LOOP;</p>
<p>}</p>
</td>
<td width="304" valign="top">
<p align="left">unsigned long IPP_Resize(void* src, int srcWidth, int srcHeight,int srcStep,  void* dst,  int dstWidth, int dstHeight, int dstStep,   double m_nzoom_x, double m_nzoom_y, int interpolation)</p>
<p align="left">  {</p>
<p align="left">      //   define record time variable<br />    unsigned long start_clock,stop_clock;     start_clock= RUNTIME;</p>
<p align="left"> // define IPP function parameter</p>
<p align="left">     IppiRect srcRoi = {0,0, srcWidth, srcHeight};</p>
<p align="left">     IppiRect dstRoi={0,0, dstWidth,dstHeight};</p>
<p align="left"> </p>
<p align="left">     IppiSize srcSize = {srcWidth, srcHeight};</p>
<p align="left">    IppiSize dstSize = {dstWidth, dstHeight};</p>
<p align="left"> </p>
<p align="left">     int BufferSize;</p>
<p align="left">     ippiResizeGetBufSize(srcRoi, dstRoi, 1, interpolation, &amp;BufferSize);</p>
<p align="left">     Ipp8u* pBuffer=ippsMalloc_8u(BufferSize);</p>
<p align="left"> <br /><br />     for(int i=0;i&lt;LOOP;i++)    </p>
<p align="left">     //---------- Perform IPP function:ippiResizeSqrPixel_8u_C1R  -------------------------------------------//</p>
<p align="left">     ippiResizeSqrPixel_8u_C1R((Ipp8u*)src, srcSize, srcStep, srcRoi, (Ipp8u*)dst, dstStep, dstRoi, m_nzoom_x,m_nzoom_y,0, 0, interpolation, pBuffer);</p>
<p align="left">    ippsFree(pBuffer);<br />    stop_clock = RUNTIME;<br />      int mhz;<br />    ippGetCpuFreqMhz(&amp;mhz);<br />     return (stop_clock - start_clock)/mhz/LOOP;</p>
</td>
</tr>
</tbody>
</table>
</p>
<p> </p>
<p><b>Performance Gain</b> </p>
<p>On one test machine (core 2 Quad 2.66GHz), as the result image show that the performance gain is 15654/353=<strong>44x</strong>.</p>
<p>The test is linking serial IPP static library.  As the ippiResize is threaded in dynamic library and threaded IPP static library. If enable the multithread, the performance gain will be more than <strong>50x</strong> (depends on the core numbers and image size).<br /><br /><strong>Conclusion<br /></strong>Intel® Parallel Studio 2011 provide developer a first suit of tool for easy developing parallel application on multi-core platform. IPP is part of key component of Intel® Parallel Studio. It provide over thousands highly-optimizated functions that offer the support for for developing high performance digital media application. This article describes a brief way to adopt IPP function instead of source code via Parallel Studio Project and gain over<strong> 40x</strong> performance speed up outright.  </p>
<p>
<table cellpadding="5" cellspacing="0" rules="none" border="1">
<tbody>
<tr>
<th align="left" valign="middle" >Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>
<p>Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.</p>
<p align="right">Notice revision #20110804</p>
</td>
</tr>
</tbody>
</table>
</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/accelerate-your-application-via-ipp-image-processing-in-parallel-studio-c-code-vs-ipp-resize/</link>
      <pubDate>Sun, 29 Aug 2010 09:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/accelerate-your-application-via-ipp-image-processing-in-parallel-studio-c-code-vs-ipp-resize/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/accelerate-your-application-via-ipp-image-processing-in-parallel-studio-c-code-vs-ipp-resize/</guid>
      <category>Intel® Integrated Performance Primitives Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
    <item>
      <title>Guided Auto-Parallel (GAP)</title>
      <description><![CDATA[ <p><b>Guided Auto-Parallel Overview</b> <br />The guided auto-parallelization feature of the Intel(R) Compiler is a tool that offers selective advice resulting in better performance of serially coded applications.<br />The advice generated by the compiler typically falls under three broad categories: <br /><br />• Advice to use local-variable: the compiler advises you to make simple source changes that are localized to a loop-nest or a routine. For example, you may receive advice to use a local-variable for the upper-bound of a loop (instead of a class member) OR to initialize a local variable unconditionally at the top of the loop-body OR to add restrict keyword to pointer-arguments of a function definition (if it is appropriate). <br /><br />• Advice to apply pragmas: the compiler advises you to apply a new pragma on a certain loop-level if the pragma semantics can be satisfied (you have to verify this). In many cases, you may be able to apply the pragma (thus implicitly asserting new program/loop properties) that the compiler can take advantage of to perform enhanced optimizations. <br /><br />• Advice to add compiler options: the compiler advises you to add new command-line options that assert new properties.<br />The advice is specific but optional; you can either implement it or reject it. To receive this advice all you need to do is use the -guide [Linux* and Mac OS* X] or the /Qguide [Windows*] set of compiler. The compiler does not generate any object files or executables during the guided auto-parallelization run.<br /><br />Use the -guide or /Qguide options in addition to your normally used compiler options. The compiler advice targets the optimizations enabled at the chosen optimization level. If you decide to take the advice suggested by the guided auto-parallelization compilation run, then make the suggested code changes or use the suggested compiler options and recompile the program, this time without using the -guide or /Qguide options. The performance of your program should improve.<br /><br />Use guided auto-parallelization along with auto-parallelization when you have serial code you wish to parallelize using the auto-parallelization options [-parallel or /Qparallel] and also wish to get advice on further parallelizing opportunities that the guided auto-parallelization may suggest. <br /><br />Use guided auto-parallelization without enabling auto-parallelization when you are interested in improving the performance of your single-threaded code or when you want to improve the performance of your applications with explicit threading without relying on the compiler for auto-parallelization.<br /><br /><br /><strong>Preparing the project to run Guided Auto Parallel (GAP)</strong><br />1) Convert project to use Intel C++ projects<br />2) Change configuration to “Release”. <br />       a. GAP only works with /O2 or higher optimization.<br />3) After conversion go to menu -&gt; Build -&gt; Clean All</p>
<p><img height="691" width="493" src="http://software.intel.com/file/28954" alt="Figure 1: Project Conversion" /><br /><br />Figure 1:  Convert to using Intel compiler project.<br /><br /><br /><b>Running Guided Auto Parallel (GAP)</b><br />There are several ways to invoke Guided Auto Parallel (GAP) in the IDE, depending on whether you want analysis for the whole solution, the project, a single file, a function, or a range of lines source code. For the purpose of this tutorial, we will use single file analysis.<br /><br />1) Select scalar_dep.cpp right click -&gt; Intel C++ Composer XE -&gt; Guide Auto Parallel -&gt; Run Analysis on file “scalar_dep.cpp”<br />a. Click Run Analysis in the Configure Analysis dialog box.<br /><br /><br /><br /><img height="725" width="944" src="http://software.intel.com/file/28956" /><br /><br />Figure 2:  Run Guided Auto-Parallel Analysis<br /><br /><br /><img height="496" width="581" src="http://software.intel.com/file/28957" /><br /><br /><br />Figure 3:  Configuring Analysis<br /><br /><br /><b>Viewing the results from Guided Auto Parallel (GAP)</b><br />The output generated by GAP analysis can be view in the standard Output Window of the IDE, or in the “Error List” window filtered by “Messages”.  Note that GAP message in the standard Output Window are encapsulated between “GAP REPORT LOG OPENED” and “END OF GAP REPORT LOG”.<br /><br /><br /><img height="558" width="924" src="http://software.intel.com/file/28958" /><br /><br />Figure 4 GAP messages in IDE standard Output Window.<br /><br /><br /><br /><img height="305" width="1280" src="http://software.intel.com/file/28959" /><br /><br />Figure 5 GAP message in Error List window filtered by Messages.<br /><br /><br />User can also redirect GAP output to a file. To output GAP messages to a file by check the box “Send remarks to a file”<br /><br /><img height="499" width="585" src="http://software.intel.com/file/28960" /><br /><br />Figure 6 Add option to output GAP messages to a file.<br /><br />Note that GAP messages will not be available in the IDE standard Output Window or Error List Window if this option is enabled.</p>
<br /><b>Analyzing GAP messages</b><br />Analyze the output generated by GAP analysis and determine whether or not the specific suggestion(s) provided by GAP is appropriate for specified source code. <br />For this sample tutorial, GAP generates the following output for the following loop at line 49 of scalar_dep.cpp:<br /><br />for (i=0; i&lt;n; i++) {<br />if (A[i] &gt; 0) {b=A[i]; A[i] = 1 / A[i]; }<br />if (A[i] &gt; 1) {A[i] += b;}<br />}<br /><br /><br /><i>1&gt;GAP REPORT LOG OPENED ON Tue Jun 29 12:13:54 2010<br />1&gt;<br />1&gt;remark #30761: Add -Qparallel option if you want the compiler to generate recommendations for improving auto-parallelization.<br />1&gt;C:\gap_test\test\scalar_dep.cpp(49): remark #30515: (VECT) Loop at line 49 cannot be vectorized due to conditional assignment(s) into the following variable(s): b. This loop will be vectorized if<br />the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier <br />in the same iteration.<br />1&gt;Number of advice-messages emitted for this compilation session: 1.<br />1&gt;END OF GAP REPORT LOG<br /><br /></i>By default, the compiler will generate a remark #30761 to enable auto parallelization to generate recommendation for improving auto-parallelization. Remark #30515 indicates if variable b can be unconditionally assigned, the compiler will be able to vectorize the loop.<br /><br />To get GAP advice for parallelization, enable parallelization (/Qparallel) and rerun the GAP analysis.<br /><br /><br /><img height="639" width="925" src="http://software.intel.com/file/28961" /><br /><br />figure 7: Enabling Parallelization (/Qparallel).<br /><br /><br /><br />
<p><i>1&gt;GAP REPORT LOG OPENED ON Tue May 18 11:42:58 2010<br /></i><i>1&gt;<br /></i><i>1&gt;C:\test\scalar_dep.cpp(49): remark #30521: (PAR) Loop at line 49 cannot be parallelized due to conditional assignment(s) into the following variable(s): b. This loop will be parallelized if the <br />variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in <br />the same iteration.<br /></i><i>1&gt;C:\test\scalar_dep.cpp(49): remark #30525: (PAR) If the trip count of the loop at line 49 is greater than 188, then use "#pragma loop count min(188)" to parallelize this loop. [VERIFY] Make <br />sure that the loop has a minimum of 188 iterations.<br /></i><i>1&gt;Number of advice-messages emitted for this compilation session: 2.<br /></i><i>1&gt;END OF GAP REPORT LOG</i><i></i></p>
<p> </p>
<p>The remark #30521 indicates that loop at line 49 cannot parallelize because the variable b is conditionally assigned, and remark #30525 indicates that the loop trip count must be greater than 188 for the compiler to parallelize the loop.</p>
<p>The user needs to verify if the changes recommend by GAP are appropriate and do not change the semantics of the program.  Apply the necessary change(s) and re-compile the source file.  For this loop, we made the following changes to enable parallelization and vectorization of the loop as recommended by GAP:</p>
<p> </p>
<p>#pragma loop count min (188)</p>
<p>for (i=0; i&lt;n; i++) {<br />b = A[i];<br />if (A[i] &gt; 0) {A[i] = 1 / A[i];}<br />if (A[i] &gt; 1) {A[i] += b;}  <br />}<br /><br />To verify that the loop was parallelize and vectorized:</p>
<p>1)   For the code that GAP provides a message(s), verify with the vectorizer report or parallel report after applying the change(s) provided by GAP.<br />2)   Add the option /Qvec-report1  /Qpar-report1 to the Additional Linker command line options dialog box.  (Note that in Visual Studio /GL is on by default which will enable /Qipo if you are using Intel compiler.  If /Qipo is not enable, the option should be added to the C/C++ Additional Options).<br />3)   Recompile without Guided Auto Parallel.<br /><br /><br /><img height="636" width="929" src="http://software.intel.com/file/28962" /><br /><br />Figure 8 Add /Qvec-report1 /Qpar-report1 to get parallelization and vectorization reports.<br /><br /><br />The output window will report that a function call at line 23 in main.cpp was vectorized and parallelized.  The reason is because /Qipo (inlining across multiple file) was enabled, and the function with the loop in scalar_dep.cpp was inline at the call in main.cpp, resulting in the report that line 23 was vectorize and parallelize automatically by the compiler.<br /><br /><b>Conclusion</b><br />Adding parallelism to serial application can be difficult. Intel compiler with Guide Auto-Parallel feature provides a low cost and effective tool to add parallelism to your application.</p>
<br />
<table cellpadding="5" cellspacing="0" rules="none" border="1">
<tbody>
<tr>
<th align="left" valign="middle" >Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>
<p>Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.</p>
<p align="right">Notice revision #20110804</p>
</td>
</tr>
</tbody>
</table> ]]></description>
      <link>http://software.intel.com/en-us/articles/guided-auto-parallel-gap/</link>
      <pubDate>Mon, 28 Jun 2010 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/guided-auto-parallel-gap/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/guided-auto-parallel-gap/</guid>
      <category>ISN General</category>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
  </channel></rss>
