<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Fri, 10 Feb 2012 05:25:50 -0800 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/intel-c-compiler-for-linux-kb/type/performance-and-optimization/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/intel-c-compiler-for-linux-kb/type/performance-and-optimization/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>First compile time slow down on Linux</title>
      <description><![CDATA[ <br />
<div id="art_pre_template"><b>Problem : The first time the compiler is run after a login or after not being run for several minutes, this initial compilation can take dramatically longer than subsequent compilations. Subsequent compilations are significantly faster <br /><br /></b><b>Environment : RedHat Enterprise Linux and its derivativatives</b><br /><br /><br /><b>Root Cause : Full look up of multiple directories causes timeout.</b><br /><br /><br /><b>Resolution1 : Remove as many files and directory as you can from /tmp <br /></b>The slowness of the first compilation is due to the license manager examining every file on /tmp. This can initially take several seconds as this information is not iniitally cached by the OS. To avoid long delays, remove all unnecessary files from /tmp to speed up this process. Or see Resolution 2 below to improve the speed of the 'stat' operation on /tmp.<br /> <br /><br /><strong>Resolution2 : Modify you $LS_OPTIONS environment variable to --color=none -U<br /></strong>This is one of the faster ls option settings. It will prevent you from grabbing all inode information unless you explicitly want it.<br /><br /><br /></div> ]]></description>
      <link>http://software.intel.com/en-us/articles/first-compile-time-slow-down-on-linux/</link>
      <pubDate>Tue, 24 Jan 2012 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/first-compile-time-slow-down-on-linux/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/first-compile-time-slow-down-on-linux/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
    </item>
    <item>
      <title>Inlining  is disabled by -pg instrumentation for gprof</title>
      <description><![CDATA[ The Intel Compiler for Linux supports the option -pg. This instruments the binary to allow function level profiling using gprof. To do this, it also disables function inlining, which may result in some loss of performance. This consequence of -pg is not documented in version 12.1 of the Intel Compiler for Linux, but will be documented in future versions.<br />          For performance analysis and profiling of applications without impacting inlining, Intel(R) VTune(TM) Amplifier XE may be used. ]]></description>
      <link>http://software.intel.com/en-us/articles/inlining-is-disabled-by-pg-instrumentation-for-gprof/</link>
      <pubDate>Fri, 20 Jan 2012 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/inlining-is-disabled-by-pg-instrumentation-for-gprof/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/inlining-is-disabled-by-pg-instrumentation-for-gprof/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
    </item>
    <item>
      <title>Performance Tools for Software Developers - Auto parallelization and  /Qpar-threshold</title>
      <description><![CDATA[ <!--CTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dt-->
<table cellpadding="0" cellspacing="15" border="0">
<tbody>
<tr>
<td class="bodycopy">
<p>The auto-parallelization feature of the Intel C++ Compiler automatically translates serial portions of the input program into semantically equivalent multithreaded code. Automatic parallelization determines the loops that are good work sharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as is needed in programming with OpenMP directives. The OpenMP and Auto-parallelization applications provide the performance gains from shared memory on multiprocessor systems, IA-32 and  Intel 64.</p>
<p>The following table lists the options that enable Auto-parallelization:</p>
<blockquote><b>/Qparallel:</b><br />Enables the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. <br /><br /><b>/Qpar-threshold:n</b><br />This option sets a threshold for the auto-parallelization of loops based on the probability of profitable execution of the loop in parallel. To use this option, you must also specify -parallel (Linux and Mac OS X) or /Qparallel (Windows). The default is /Qpar-threshold:100.</blockquote>
<p>This option is useful for loops whose computation work volume cannot be determined at compile-time. The threshold is usually relevant when the loop trip count is unknown at compile-time.</p>
<p>The compiler applies a heuristic that tries to balance the overhead of creating multiple threads versus the amount of work available to be shared amongst the threads.</p>
<p>The n is an integer whose value is the threshold for the auto-parallelization of loops. Possible values are 0 through 100. If <i>n</i> is 0, loops get auto-parallelized always, regardless of computation work volume. If <i>n</i> is 100, loops get auto-parallelized when performance gains are predicted based on the compiler analysis data. Loops get auto-parallelized only if profitable parallel execution is almost certain. The intermediate 1 to 99 values represent the percentage probability for profitable speed-up. For example, <i>n</i>=50 directs the compiler to parallelize only if there is a 50% probability of the code speeding up if executed in parallel.</p>
<p>Also, to be "100%" sure that a loop will benefit from parallelization, the compiler needs to know the iteration count at compile time. For a "99%" or lower threshold, knowing the iteration count at compile time is not a requirement.</p>
<p>This leads to a big difference in the number of loops parallelized at 99% compared to 100%. For many apps, 99% is a better setting, but for some apps with a lot of short loops, 99% will slow them down.</p>
<p>The following example, int_sin.c, does not auto parallelize when we use /Qpar-threshold:100 using command line below :</p>
<blockquote>C: &gt;icl -c /Qparallel /Qpar-report3 /Qpar-threshold:100 int_sin.c
<p>If we use /Qpar-threshold:99 then it is parallelized.</p>
<p><b>Example:</b></p>
<p class="whs23" ><b ></b></p>
<p class="MsoNormal" ><span >// int_sin.c</span></p>
<p class="MsoNormal" ><span >// Intel C++ compiler sample program</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span >#include</span><span ><span >&lt;stdio.h&gt;</span></span></p>
<p class="MsoNormal" ><span >#include</span><span ><span >&lt;stdlib.h&gt;</span></span></p>
<p class="MsoNormal" ><span >#include</span><span ><span >&lt;time.h&gt;</span></span></p>
<p class="MsoNormal" ><span >#include</span><span ><span >&lt;mathimf.h&gt;</span></span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span >// Function to be integrated</span></p>
<p class="MsoNormal" ><span >// Define and prototype it here</span></p>
<p class="MsoNormal" ><span >// | sin(x) |</span></p>
<p class="MsoNormal" ><span >#define </span><span >INTEG_FUNC(x) fabs(sin(x))</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span >// Prototype timing function</span></p>
<p class="MsoNormal" ><span >double </span><span >dclock( <span >void</span>);</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span >int </span><span >main( <span >void</span>)</span></p>
<p class="MsoNormal" ><span >{</span></p>
<p class="MsoNormal" ><span ><span >// Loop counters and number of interior points</span></span></p>
<p class="MsoNormal" ><span ><span >unsigned </span><span >int</span> i, j, N;</span></p>
<p class="MsoNormal" ><span ><span >// Stepsize, independent variable x, and accumulated sum</span></span></p>
<p class="MsoNormal" ><span ><span >double</span> step, x_i, sum;</span></p>
<p class="MsoNormal" ><span ><span >// Timing variables for evaluation </span></span></p>
<p class="MsoNormal" ><span ><span >double</span> start, finish, duration, clock_t;</span></p>
<p class="MsoNormal" ><span ><span >// Start integral from</span></span></p>
<p class="MsoNormal" ><span ><span >double</span> interval_begin = 0.0;</span></p>
<p class="MsoNormal" ><span ><span >// Complete integral at</span></span></p>
<p class="MsoNormal" ><span ><span >double</span> interval_end = 2.0 * 3.141592653589793238;</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Start timing for the entire application</span></span></p>
<p class="MsoNormal" ><span >start = clock();</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span >printf( <span >" "</span>);</span></p>
<p class="MsoNormal" ><span >printf( <span >" Number of | Computed Integral | "</span>);</span></p>
<p class="MsoNormal" ><span >printf( <span >" Interior Points | | "</span>);</span></p>
<p class="MsoNormal" ><span ><span >for</span> (j=2;j&lt;10;j++)</span></p>
<p class="MsoNormal" ><span >{</span></p>
<p class="MsoNormal" ><span >printf( <span >"------------------------------------- "</span>);</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Compute the number of (internal rectangles + 1)</span></span></p>
<p class="MsoNormal" ><span >N = 1 &lt;&lt; j;</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Compute stepsize for N-1 internal rectangles</span></span></p>
<p class="MsoNormal" ><span >step = (interval_end - interval_begin) / N;</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Approx. 1/2 area in first rectangle: f(x0) * [step/2]</span></span></p>
<p class="MsoNormal" ><span >sum = INTEG_FUNC(interval_begin) * step / 2.0;</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Apply midpoint rule:</span></span></p>
<p class="MsoNormal" ><span ><span >// Given length = f(x), compute the area of the</span></span></p>
<p class="MsoNormal" ><span ><span >// rectangle of width step</span></span></p>
<p class="MsoNormal" ><span ><span >// Sum areas of internal rectangle: f(xi + step) * step</span></span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >for</span> (i=1;i&lt;N;i++)</span></p>
<span >{</span>
<p class="MsoNormal" ><span >x_i = i * step;</span></p>
<p class="MsoNormal" ><span >sum += INTEG_FUNC(x_i) * step;</span></p>
<p class="MsoNormal" ><span >}</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span ><span >// Approx. 1/2 area in last rectangle: f(xN) * [step/2]</span></span></p>
<p class="MsoNormal" ><span >sum += INTEG_FUNC(interval_end) * step / 2.0;</span></p>
<p class="MsoNormal" > </p>
<p class="MsoNormal" ><span lang="IT" >printf( <span >" %10d | %14e | "</span>, N, sum);</span></p>
<p class="MsoNormal" ><span >}</span></p>
<p class="MsoNormal" ><span >finish = clock();</span></p>
<p class="MsoNormal" ><span >duration = (finish - start);</span></p>
<p class="MsoNormal" ><span >printf( <span >" "</span>);</span></p>
<p class="MsoNormal" ><span >printf( <span >" Application Clocks = %10e "</span>, duration);</span></p>
<p class="MsoNormal" ><span >printf( <span >" "</span>);</span></p>
<p class="MsoNormal" ><span ><span >}</span></span></p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td><img height="5" width="388" src="http://software.intel.com/file/6324" /></td>
</tr>
<tr>
<td height="10"></td>
</tr>
</tbody>
</table> ]]></description>
      <link>http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/</link>
      <pubDate>Sun, 23 Jan 2011 10:30:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
    <item>
      <title>How to manually target 2nd generation Intel Core processors with support for Intel AVX</title>
      <description><![CDATA[ <p><br /><strong>Product :</strong> Intel C++ Composer XE<br /><br /><strong>Version :</strong> 2011  (contains Intel C++ Compiler 12.0 or 12.1)<br /><br /><br />Manual processor dispatch allows you to write one or more versions of a function that will run only on specified types of Intel processor. The Intel processor type is detected at runtime, and the corresponding function version is executed. This feature is available only for Intel processors of IA-32 or Intel 64 architecture. It is not available for non-Intel processors nor for Intel processors of IA-64 architecture. Applications built with the manual processor dispatch feature may be more highly optimized for Intel processors than for non-Intel processors.<br /><br />The  <strong>__declspec(cpu_ dispatch(cpuid,cpuid,…))</strong>  syntax is used to provide a list of targeted processors along with an empty function body (i.e., a function stub). The <strong>__declspec(cpu_specific(cpuid))</strong> syntax is used to declare each function version that is targeted at a particular type or types of processor.<br /><br />The following table lists possible values for cpuid (names are not case-sensitive):<br /><br />
<table width="100%" cellpadding="0" cellspacing="0" border="1">
<thead>
<tr>
<td width="24%" valign="top">
<p align="center"><b>Argument for cpuid</b></p>
</td>
<td width="75%" valign="top">
<p align="center"><b>Processors</b></p>
</td>
</tr>
</thead>
<tbody>
<tr>
<td width="24%" valign="top">
<p>core_2nd_gen_avx</p>
</td>
<td width="75%" valign="top">
<p>2nd generation Intel® Core<sup>TM</sup> processor family with support for Intel® Advanced Vector Extensions (Intel® AVX).</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>core_aes_pclmulqdq</p>
</td>
<td width="75%" valign="top">
<p>Intel® Core<sup>TM</sup>  processors with support for Advanced Encryption Standard (AES) instructions and carry-less multiplication instruction</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>core_i7_sse4_2</p>
</td>
<td width="75%" valign="top">
<p>Intel® Core<sup>TM</sup>  processor family with support for Intel® SSE4 Efficient Accelerated String and Text Processing instructions  (SSE4.2)</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>atom</p>
</td>
<td width="75%" valign="top">
<p>Intel® Atom<sup>TM</sup> processors</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>core_2_duo_sse4_1</p>
</td>
<td width="75%" valign="top">
<p>Intel® 45nm Hi-k next generation Intel® Core<sup>TM</sup> microarchitecture processors with support for Intel® SSE4 Vectorizing Compiler and Media Accelerators instructions (SSE4.1)</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>core_2_duo_ssse3</p>
</td>
<td width="75%" valign="top">
<p>Intel® Core<sup>TM</sup>2 Duo processors and Intel® Xeon® processors with Intel® Supplemental Streaming SIMD Extensions 3 (SSSE3)</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>pentium_4_sse3</p>
</td>
<td width="75%" valign="top">
<p>Intel® Pentium 4 processor with Intel® Streaming SIMD Extensions 3 (Intel® SSE3), Intel® Core<sup>TM</sup> Duo processors, Intel® Core<sup>TM</sup> Solo processors</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>pentium_4</p>
</td>
<td width="75%" valign="top">
<p>Intel® Intel Pentium 4 processors</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>pentium_m</p>
</td>
<td width="75%" valign="top">
<p>Intel® Pentium M processors</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>pentium_iii</p>
</td>
<td width="75%" valign="top">
<p>Intel® Pentium III processors</p>
</td>
</tr>
<tr>
<td width="24%" valign="top">
<p>generic</p>
</td>
<td width="75%" valign="top">
<p>Other IA-32 or Intel 64 processors or compatible  processors not provided by Intel Corporation</p>
</td>
</tr>
</tbody>
</table>
<br /><br />If no other matching Intel processor type is detected, the “generic” version of the function will be executed. If the program is intended to execute on non-Intel processors, a “generic” function version must be provided. The degree of optimization of the generic function version and the processor features that it assumes are under the control of the programmer.<br /><br />The following framework illustrates how the <strong>cpu_dispatch</strong> and <strong>cpu_specific</strong> keywords might be used to create function versions for the 2nd generation Intel Core processor family, for the Intel Core processor family, for the Intel Core 2 Duo processor family, and for other Intel and compatible, non-Intel processors. Each processor-specific function body might contain processor-specific intrinsic functions, or it might be placed in a separate source file and compiled with a processor-specific compiler option. See <a href="http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/">http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/</a> for more details of such options.</p>
<pre name="code" class="cpp"><br />#include &lt;stdio.h&gt;

// need to create specific function versions for the following processors:
__declspec(cpu_dispatch(generic, core_2_duo_ssse3, core_i7_sse4_2, core_2nd_gen_avx))
void dispatch_func() {};      //  stub that will call the appropriate specific function version

__declspec(cpu_specific(generic))
void dispatch_func() {
printf("\nCode for non-Intel processors and generic Intel processors goes here\n");
}

__declspec(cpu_specific(core_2_duo_ssse3))
void dispatch_func() {
printf("\nCode for Intel Core 2 Duo processors with support for SSSE3 goes here\n");
}

__declspec(cpu_specific(core_i7_sse4_2))
void dispatch_func() {
printf("\nCode for Intel Core processors with support for SSE4.2 goes here\n");
}

__declspec(cpu_specific(core_2nd_gen_avx))
void dispatch_func() {
printf("\nCode for 2nd generation Intel Core processors goes here\n");
}

int main() {
dispatch_func();
printf("Return from dispatch_func\n");
return 0;
}
</pre>
<p><br /><br />
<table cellpadding="5" cellspacing="0" rules="none" border="1">
<tbody>
<tr>
<th align="left" valign="middle" >Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>
<p>Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.</p>
<p align="right">Notice revision #20110804</p>
</td>
</tr>
</tbody>
</table>
</p>
<p> </p>
<p><i>[DISCLAIMER: The information on this web site is intended for hardware system manufacturers and software developers. Intel does not warrant the accuracy, completeness or utility of any information on this site. Intel may make changes to the information or the site at any time without notice. Intel makes no commitment to update the information at this site. ALL INFORMATION PROVIDED ON THIS WEBSITE IS PROVIDED "as is" without any express, implied, or statutory warranty of any kind including but not limited to warranties of merchantability, non-infringement of intellectual property, or fitness for any particular purpose. Independent companies manufacture the third-party products that are mentioned on this site. Intel is not responsible for the quality or performance of third-party products and makes no representation or warranty regarding such products. The third-party supplier remains solely responsible for the design, manufacture, sale and functionality of its products. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others.]</i></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/how-to-manually-target-2nd-generation-intel-core-processors/</link>
      <pubDate>Thu, 13 Jan 2011 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/how-to-manually-target-2nd-generation-intel-core-processors/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/how-to-manually-target-2nd-generation-intel-core-processors/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
    <item>
      <title>Step-by-Step Application Performance Tuning with Intel Compilers</title>
      <description><![CDATA[ <span class="sectionHeading">Application Performance:  A Step-by-Step Introduction to Application Tuning with Intel® Compilers</span><br /><br /><span class="sectionBodyText">Before you begin performance tuning, you may want to check the correctness of your application by building it without optimization using /Od (Windows*) or -O0 (Linux* or Mac OS* X). In compiler versions 11 and later, all optimization levels assume support for the SSE2 instruction set by default. <br /><br /><span class="sectionHeading">1. </span>Use the general optimization options (Windows /O1, /O2 or /O3; Linux and Mac OS X -O1, -O2, or -O3) and determine which one works best for your application by measuring performance with each. Most users should start at /O2 (–O2), the default, before trying more advanced optimizations. Next, for loop-intensive applications, try /O3 (-O3).  These options are available for both Intel® and non-Intel microprocessors but they may perform more optimizations for Intel microprocessors than they perform for non-Intel microprocessors.<br /><br /><span class="sectionHeading">2.</span> Fine-tune performance to target IA-32 and Intel 64-based systems using processor-specific options. Examples are /QxSSE4.2 (–xsse4.2) for the Intel® Core™ processor family, e.g. the Intel Core i7 processor, and /arch:SSE3 (-msse3) for compatible, non-Intel processors that support at least the SSE3 instruction set. Alternatively, you can use /QxHOST (-xhost) which will use the most advanced instruction set for the processor on which you compiled. This option is available for both Intel® and non-Intel microprocessors but it may perform more optimizations for Intel microprocessors than it performs for non-Intel microprocessors. For a more extensive list and description of options that optimize for specific processors or instruction sets, please see the online article “<a href="http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/" title="SSE generation and processor-specific optimizations">Intel® compiler options for SSE generation and processor-specific optimizations</a>” and the Intel Compiler User and Reference Guides.<br /><br /><span class="sectionHeadingText">3.</span> Add interprocedural optimization (IPO), /Qipo (-ipo) and/or profile-guided optimization (PGO), /Qprof-gen and /Qprof-use (-prof-gen and -prof-use); then measure performance again to determine whether your application benefits from one or both of them.<br /><br /><span class="sectionHeadingText">4.</span> Optimize your application for vector and parallel execution on multi-threaded, multi-core and multi-processor systems using:<br />advice from the new Guided Auto-Parallelism (GAP) feature, /Qguide (-guide); <br />the Intel® Cilk™ Plus language extensions for C/C++;<br />the parallel performance options /Qparallel (-parallel) or /Qopenmp (-openmp);<br />the CoArray feature of Fortran 2008;<br />or by using the Intel® Performance Libraries included with the product. <br />These optimization steps are applicable to both Intel and non-Intel microprocessors, but may result in a greater performance gain on Intel microprocessors than on non-Intel microprocessors.<br /><br /><span class="sectionHeading">5.</span> Use Intel® VTune™ Amplifier XE to help you identify serial and parallel performance “hotspots” so that you know which specific parts of your application could benefit from further tuning. Use Intel® Inspector XE to reduce the time to market for threaded applications by diagnosing memory and threading errors and speeding up the development process. These products cannot be used on non-Intel microprocessors.<br /></span><br />For more details, please consult the main product documentation, e.g. in the <a href="http://software.intel.com/en-us/articles/intel-software-technical-documentation/">Intel® Software Documentation Library</a>. A brief summary of the major optimization options of the Intel Compiler is available in the <a href="http://software.intel.com/sites/products/collateral/hpc/compilers/compiler_qrg12.pdf" title="Quick-Reference Guide to Optimization with Intel® Compilers version 12">Quick-Reference Guide to Optimization with Intel® Compilers version 12</a>. ]]></description>
      <link>http://software.intel.com/en-us/articles/step-by-step-application-performance-tuning-with-intel-compilers/</link>
      <pubDate>Thu, 11 Nov 2010 21:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/step-by-step-application-performance-tuning-with-intel-compilers/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/step-by-step-application-performance-tuning-with-intel-compilers/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
      <category>Intel® VTune™ Amplifier XE Knowledge Base</category>
    </item>
    <item>
      <title>A Guide to Auto-vectorization with Intel® C++ Compilers</title>
      <description><![CDATA[ <div class="sectionHeading">Introduction</div>
<br />The goal of this Guide is to provide guidelines for enabling compiler auto-vectorization with the Intel® C++ Compilers.  This document  is aimed at C/C++ programmers working on systems based on Intel® processors or compatible, non-Intel processors that support SIMD instructions such as Intel® Streaming SIMD Extensions (Intel® SSE).  This includes Intel 64 and most IA-32 systems, but excludes systems based on Intel® Itanium® processors.  The examples presented refer to Intel SSE, but many of the principles apply also to other SIMD instruction sets.  While the examples used are specific to C++ programs, much of the concepts discussed are equally applicable to Fortran programs.<br /><br /><a href="http://software.intel.com/file/38565/">Click here to continue reading the article.</a><br /><br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/</link>
      <pubDate>Mon, 08 Nov 2010 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
    <item>
      <title>New fast basic random number generator SFMT19937 in Intel MKL</title>
      <description><![CDATA[ <br /><br />Intel MKL 10.3 introduced a new basic generators: a SIMD friendly Fast Mersenne Twister pseudorandom number <strong>SFMT19937</strong> generator.<br /><br /><strong>SFMT19937</strong> is analogous to Mersenne Twister (MT) basic generators. But it can take the advantage of SIMD instructions and provide the fast implementation in the processors. <br /><br /><br />To learn more information on SFMT algorithm, please check the bellow article.<br /><br /><em>Saito, M., and Matsumoto, M. SIMD-oriented Fast Mersenne Twister: a 128-bit Pseudorandom Number Generator. Monte Carlo and Quasi-Monte Carlo Methods 2006, Springer, Pages 607 – 622, 2008.<br /></em><a href="http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/earticles.html"><em>http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/earticles.html</em></a><br /><br /><br />The following is an example application using Intel MKL SFMT19937<br /><br /><br />
<pre name="code" class="cpp">#include &lt;stdio.h&gt;
#include “mkl_vsl.h”
 
int main()
{
   double r[1000]; /* buffer for random numbers */
   double s; /* average */
   VSLStreamStatePtr stream;
   int i, j;
    
   /* Initializing */        
   s = 0.0;
   vslNewStream( &amp;stream, VSL_BRNG_SFMT19937, 777 );
    
   /* Generating */        
   for ( i=0; i&lt;10; i++ );
   {
      vdRngGaussian( VSL_RNG_METHOD_GAUSSIAN_ICDF, stream, 1000, r, 5.0, 2.0 );
      for ( j=0; j&lt;1000; j++ );
      {
         s += r[j];
      }
   }
   s /= 10000.0;
    
   /* Deleting the stream */        
   vslDeleteStream( &amp;stream );
    
   /* Printing results */        
   printf( “Sample mean of normal distribution = %f\n”, s );
    
   return 0;
}<br /><br /><br />
</pre>
<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/new-fast-basic-random-number-generator-sfmt19937-in-intel-mkl/</link>
      <pubDate>Sat, 06 Nov 2010 11:30:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/new-fast-basic-random-number-generator-sfmt19937-in-intel-mkl/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/new-fast-basic-random-number-generator-sfmt19937-in-intel-mkl/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Linux* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Denormal paths speedup in VML by setting FTZ/DAZ setting</title>
      <description><![CDATA[ <p>Intel® MKL VML accuracy setting mode variable is extended with a new setting from Intel MKL 10.3 onwards.</p>
<p>Users can turn ON or OFF this setting by using VML_FTZDAZ_ON / VML_FTZDAZ_OFF (default) in VML functions.</p>
<p>VML_FTZDAZ_ON mode improves performance of computations that involve denormalized numbers at the cost of reasonable accuracy loss.</p>
<p>Enabling this mode changes numerical behavior of the functions:  denormalized input values may be treated as zeros and denormalized results may flush to zero.  Accuracy loss may occur if input and/or output values are close to denormal range.</p>
<p>Usage example:</p>
<p>vmlSetMode( VML_LA | VML_FTZDAZ_ON);</p>
<p>vmdExp(1000, a, r, VML_LA | VML_FTZDAZ_ON);</p>
<br /><br /><br />
<p>
<table cellpadding="5" cellspacing="0" rules="none" border="1">
<tbody>
<tr>
<th align="left" valign="middle" >Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>
<p>Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.</p>
<p align="right">Notice revision #20110804</p>
</td>
</tr>
</tbody>
</table>
 ]]></description>
      <link>http://software.intel.com/en-us/articles/denormal-paths-speedup-in-vml-by-setting-ftzdaz-setting/</link>
      <pubDate>Sat, 06 Nov 2010 11:30:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/denormal-paths-speedup-in-vml-by-setting-ftzdaz-setting/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/denormal-paths-speedup-in-vml-by-setting-ftzdaz-setting/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>Intel® AVX optimization in Intel® MKL</title>
      <description><![CDATA[ Intel ® AVX (Intel ® Advanced Vector Extensions) is the next step in the evolution of Intel processors. Intel® MKL had Intel® AVX optimization since Intel MKL 10.2, however to activate Intel AVX code in version 10.2, users needed to use mkl_enable_instructions(). Starting from Intel MKL 10.3, the Intel AVX code will be dispatched automatically and does not require special activation. In Intel MKL 10.3, Intel AVX optimization has been extended to DGEMM/SGEMM, radix-2 Complex-to-Complex FFT, most of real VML functions and VSL distribution generators.<br /><br />The special cases illustrating speed-ups can be achieved on Intel AVX-enabled processors running an Intel AVX-enabled operating systems over Intel® Xeon® Processor 6000 and 7000 Sequence (Server) in Intel MKL 10.3 are as following:<br /><br />Intel AVX DGEMM (M, N, K=8Kx4Kx128) performs 1.8x over Intel® Xeon® Processor 6000 and 7000 Sequence (Server). <br /><br />Intel AVX DGEMM/SGEMM achieves 88-90% machine peak.<br /><br />The Intel AVX/NHM speedup is 1.8x for radix-2 1D cluster FFTs  with N=1024<br /><br />The Intel® Optimized LINPACK benchmark, using Intel AVX optimizations, performs over 1.86x (or over 80% overall efficiency) on 4 cores with N=20000.<br /><br /><br />
<table cellpadding="5" cellspacing="0" rules="none" border="1">
<tbody>
<tr>
<th align="left" valign="middle" >Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>
<p>Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.</p>
<p align="right">Notice revision #20110804</p>
</td>
</tr>
</tbody>
</table> ]]></description>
      <link>http://software.intel.com/en-us/articles/intel-avx-optimization-in-intel-mkl-v103/</link>
      <pubDate>Wed, 03 Nov 2010 11:30:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-avx-optimization-in-intel-mkl-v103/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/intel-avx-optimization-in-intel-mkl-v103/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Linux* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Information about the FTC Decision and Order on the Intel® Compilers Reimbursement Fund</title>
      <description><![CDATA[ Information on the Intel Compiler Reimbursement Fund referenced in Section VII.D of the FTC Decision and Order is available now. Please see the site, <a href="http://www.CompilerReimbursementProgram.com">www.CompilerReimbursementProgram.com</a>, for further information. ]]></description>
      <link>http://software.intel.com/en-us/articles/information-about-the-ftc-decision-and-order-on-the-intel-compilers-reimbursement-fund/</link>
      <pubDate>Mon, 01 Nov 2010 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/information-about-the-ftc-decision-and-order-on-the-intel-compilers-reimbursement-fund/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/information-about-the-ftc-decision-and-order-on-the-intel-compilers-reimbursement-fund/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Software Development Tool Suites for Intel® Atom™ Processor Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Integrated Performance Primitives Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    </item>
  </channel></rss>
