<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Sun, 12 Feb 2012 07:59:28 -0800 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/intel-fortran-compiler-for-linux-kb/type/performance-and-optimization/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/intel-fortran-compiler-for-linux-kb/type/performance-and-optimization/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>First compile time slow down on Linux</title>
      <description><![CDATA[ <br />
<div id="art_pre_template"><b>Problem : The first time the compiler is run after a login or after not being run for several minutes, this initial compilation can take dramatically longer than subsequent compilations. Subsequent compilations are significantly faster <br /><br /></b><b>Environment : RedHat Enterprise Linux and its derivativatives</b><br /><br /><br /><b>Root Cause : Full look up of multiple directories causes timeout.</b><br /><br /><br /><b>Resolution1 : Remove as many files and directory as you can from /tmp <br /></b>The slowness of the first compilation is due to the license manager examining every file on /tmp. This can initially take several seconds as this information is not iniitally cached by the OS. To avoid long delays, remove all unnecessary files from /tmp to speed up this process. Or see Resolution 2 below to improve the speed of the 'stat' operation on /tmp.<br /> <br /><br /><strong>Resolution2 : Modify you $LS_OPTIONS environment variable to --color=none -U<br /></strong>This is one of the faster ls option settings. It will prevent you from grabbing all inode information unless you explicitly want it.<br /><br /><br /></div> ]]></description>
      <link>http://software.intel.com/en-us/articles/first-compile-time-slow-down-on-linux/</link>
      <pubDate>Tue, 24 Jan 2012 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/first-compile-time-slow-down-on-linux/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/first-compile-time-slow-down-on-linux/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
    </item>
    <item>
      <title>Inlining  is disabled by -pg instrumentation for gprof</title>
      <description><![CDATA[ The Intel Compiler for Linux supports the option -pg. This instruments the binary to allow function level profiling using gprof. To do this, it also disables function inlining, which may result in some loss of performance. This consequence of -pg is not documented in version 12.1 of the Intel Compiler for Linux, but will be documented in future versions.<br />          For performance analysis and profiling of applications without impacting inlining, Intel(R) VTune(TM) Amplifier XE may be used. ]]></description>
      <link>http://software.intel.com/en-us/articles/inlining-is-disabled-by-pg-instrumentation-for-gprof/</link>
      <pubDate>Fri, 20 Jan 2012 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/inlining-is-disabled-by-pg-instrumentation-for-gprof/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/inlining-is-disabled-by-pg-instrumentation-for-gprof/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
    </item>
    <item>
      <title>Distributed Memory Coarray Fortran with the Intel Fortran Compiler for Linux: Essential Guide</title>
      <description><![CDATA[ <br />
<div id="art_pre_template"><b>Introduction : </b><br />This is an essential guide to using the Coarray Fortran (CAF) feature of the Intel Fortran Composer XE 2011 for Linux on a distributed memory cluster.<br /><br /><b>Version : </b><br />To use the distributed memory feature of Intel CAF, you must have a licensed copy of the Intel<span >®</span> Cluster Studio 2011 (Formerly Intel<span >®</span> Cluster Toolkit Compiler Edition).  The shared memory CAF feature is available in the Intel<span >®</span> Cluster Studio 2011, the Intel<span >®</span> Composer XE 2011 for Linux, and Intel<span >® </span>Visual Fortran Composer XE 2011 for Windows.  The CAF feature is currently not available under Mac OS X, and the distributed memory CAF feature is only available under the Linux operating system.<b></b><br />Requires Intel MPI for Linux version 4.0 Update 1 or greater<br />Requires Intel Composer XE 2011 (original or any Update) <br />Requires Intel Cluster Studio 2011 license for COMPILATION ONLY (no runtime checks).<br /><br /><span ><b>Configuration Set Up : </b></span><br />In order to run a distributed memory Coarray Fortran application, you must have an established cluster and an installation of Intel MPI version 4.0.1.007 or greater.  BEFORE ATTEMPTING TO RUN INTEL CAF, make sure you can run a simple MPI 'hello world' program on your cluster across multiple nodes using Intel MPI. <br /><br />Successful configuration and running of MPI jobs under Intel MPI is a prerequisite to using the Intel Fortran CAF feature in distributed memory mode.  In order to get support on the Intel CAF feature you will be asked to demonstrate the ability to run a 'hello world' MPI program on your cluster.  Please read the Intel MPI <span ><i>Release Notes</i></span> and <span ><i>Getting_Started</i></span> documents that come with your Intel MPI installation in the &lt;install directory&gt;/doc/ directory.<br /><br />Before running an Intel CAF application, perform the following steps:<br /><br />1) set up a mpd.hosts fle:  If your cluster hosts are fixed and you do NOT run under a batch system with PBS, set up a static hosts file.  In your home directory, create a file with the names of your cluster hostnames, one host per line optionally with the number of cores (processors) on each host similar to this:<br /> nodehostname1<br /> nodehostname2<br /> ...<br /> nodehostnameN<br /><br />If you run under a batch system such as PBS you will not know the hosts in your allocation until job dispatch.  In this scenario we will use the --file option to mpdboot that will use a batch system supplied hosts file (see later in this document).<br /><br />2) source Intel MPI and Intel Fortran compiler "...vars.sh" or "...vars.csh" files:  You must set up the paths to Intel MPI and Intel Fortran in your environment.  Furthermore, these should be sourced by child processes.  Thus it is recommended to perform the following source commands and/or add these to your .bashrc or .cshrc files in your home directory:<br /><br />source &lt;path to Intel MPI installation&gt;/[ia32 | intel64]/bin/mpivars.sh<br />source &lt;path to Intel Fortran installation&gt;/bin/compilervars.sh [ia32 | intel64]<br /> where you choose between 32 and 64 bit environments with ia32 or intel64 respectively.<br /><br />3) Setup a Coarray Fortran (CAF) configuration file:  When you run a distributed memory Coarray Fortran program, the application first runs a job launcher.  The job launcher invoked to start a distributed memory CAF application will use the Intel MPI 'mpiexec' command to start the job on the hosts in the cluster.  This CAF job launcher will first read the "CAF configuration file" to pick up arguments that will be passed to the Intel MPI mpiexec command.  Thus, the "CAF configuration file" is nothing more than arguments to the Intel MPI 'mpiexec' command.  And example CAF configuration file may contain:<br /> -envall -n 16 ./mycafprogram.exe<br /><br />Where you want 16 CAF images created of program "mycafprogram.exe".  Read the Intel MPI documentation on mpiexec for all possible options.  Some common options:<br /><br />-envall   copies your current environment variables to the environment of your CAF processes.  HIGHLY RECOMMENDED.<br /><br /> -n N     create N images (processes).  REQUIRED<br /><br /> -rr       round-robin image distribution of images to nodes.  OPTIONAL. Default is to pack each node with the number of images equal to the number of cores (real cores PLUS any hyperthreaded virtual cores) one host at at time in the order specified in mpd.hosts file.  Round-robin is one way to avoid using hyperthreaded cores.  With the -rr option to mpiexec, image 1 is assigned to host1 from your mpd.hosts or PBS_NODEFILE, image 2 to host2, etc. to image N on hostN, at which point the allocation cycles back to image N+1 on host1 and so on.<br /><br /> -perhost N    distribute images to hosts in groups of N.  OPTIONAL. This is another way to avoid hyperthreaded cores:  set N to the number of real cores on each host.  image 1..N are allocated on host1, images N+1 to N+N on host2, etc.<br /> <br /><br /><span ><b>Building the Application : </b></span><br />You are now ready to compile your Coarray Fortran application.  Create or use an existing Coarray Fortran application.  A sample Coarray Fortran 'hello world' application is included in the &lt;compiler install dir&gt;Samples/en_US/Fortran/coarray_samples/ directory.<br /><br />The essential compiler arguments to use for distributed memory coarray applications are:<br /><br />ifort -coarray=distributed -corray-config-file=&lt;CAF config filename&gt;<br /><br />Some essential notes:  -coarray=distributed is necessary to create a distibuted memory CAF application.  This option is only available on systems with a valid Intel Cluster Studio license.  Without this license you cannot create distributed memory Coarray Fortran applications - you can, however, create and use shared memory CAF applications with any existing Intel Composer XE 2011 for Linux or Windows license.<br /><br />-coarray-config-file=&lt;CAF configuration file&gt;   this option is used to set tell the CAF job launcher where to find the configuration file to find runtime arguments to 'mpiexec'.  This file need not exist at the time of compilation.  This file is ONLY read at job launch.  Thus, it can be changed or modified between job runs to change the number of images along with any other valid control option to 'mpiexec'.  This give the programmer a way to change the number of images and other parameters without having to recompile the application.  A reasonable name for the file may be ~/cafconfig.txt, but the name of the file and location is up to the user to decide.  One essential note:  the executable name is hard-coded in the CAF config file, so be sure that the executable name in the config file matches the name you used with the 'ifort -o &lt;name&gt;' option.  Also, be sure to use either the full pathname to the executable OR the current directory "dot" name, such as './a.out' or './mycafprogram.exe' as examples.<br /><br />Note:  -coarray-num-images=N  compiler option is ignored for -coarray=distributed.  This option is only used by shared memory Coarray Fortran applications.  The number of images for distributed memory CAF applications is ONLY controlled at job launch by the '-n N' option in the CAF config file.<br /><br />Of course, you can include any other compiler options including all optimization options.<br /><br /><br /><span ><b>Running the Application : </b></span><br />Running an Intel CAF application involves 3 commands:<br /><br /><ol>
<li>mpdboot</li>
<li>&lt;running the application executable, for example ./a.out or ./mycafprogram.exe&gt;</li>
<li>mpdallexit</li>
</ol><br />1) <b>mpdboot:</b> This command sets up the underlying runtime daemons used to control the various processes in your CAF application.  'mpdboot' sets up the runtime on the hosts in either your mpd.hosts file and/or from a file specified in the --file= option.  The mpdboot option needs a '-n &lt;number of nodes&gt;' argument.  For example, if you wish to run across 4 nodes, and there are 4 hosts in your mpd.host file:<br /><br /> mpdboot -n 4 <br /><br />will start the runtime daemons needed on all 4 nodes in your cluster or batch allocation.<br /><br />If you are using a static list of hosts in your cluster, you create a mpd.hosts file as documented in the <b>Configuration Set Up </b>section above and place that in your current directory or your home directory, running mpdboot thusly:<br /><br /> mpdboot --file=~/mpd.hosts -n 4<br /><br />for example for 4 nodes.<br /><br />If you run on under PBS, the PBS batch system will create a hosts file for you with the list of hosts in your current batch allocation and set environment variable PBS_NODEFILE to point to this hosts file.  Thus, you can use this mpdboot command:<br /><br /> mpdboot --file=$PBS_NODEFILE -n 4<br /><br />Other batch systems will similarly create a hosts file and set an appropriate environment variable.  Please consult your site system administrator for the environment variable and hosts file, as these host file conventions vary widely by batch system and local configuration.<br /><br />Another mpdboot option often needed is '--rsh=[rsh | ssh]' where the remote connection method is specified.  Again, consult your site system administrator to determine if MPI uses RSH or SSH for remote connections.  In most cases ssh is used.   Thus, an example mpdboot command would look like:<br /><br /> mpdboot --rsh=ssh --file=~/mpd.hosts -n N<br />or<br /> mpdboot --rsh=rsh --file=$PBS_NODEFILE -n N<br /><br />CHECK:  run the command 'mpdtrace' or 'mpdtrace -l' to confirm that your mpdboot launched runtime daemons on each node and that each node is communicating.  If all is well, you will get a list of nodes participating in your runtime environment.  If not, you will get an error message such as:<br />mpdtrace: cannot connect to local mpd (/tmp/mpd2.console_rwgreen); possible causes:<br /> 1. no mpd is running on this host<br /> 2. an mpd is running but was started without a "console" (-n option)<br /><br />YOU CANNOT RUN AN INTEL CAF APPLICATION UNTIL THE MPDTRACE COMMAND COMPLETES SUCCESSFULLY.<br /><br /><br />2) <b>Run your compiled Intel CAF application</b><br /><br />simply invoke your compiled Intel CAF application.  This application contains a job launcher that invokes the Intel MPI 'mpiexec' command using arguments from your CAF config file.<br /><br />./a.out<br />./mycafprogram.exe<br /><br />along with any command line arguments to the application.<br /><br />Need to change the number of images launched or the arguments to mpiexec?  Simply change the CAF config file settings in the text file.  Remember, the -coarray-config-file=  options used at compile time set the name an location for this text file.  You should use a name and location you can remember for this file, such as -coarray-config-file=~/cafconfig.txt<br />Then just add mpiexec options to ~/cafconfig.txt, for example<br /> -perhost 2 -envall -n 64 ./a.out<br /><br />Note:  environment variable FORT_COARRAY_NUM_IMAGES has not effect on distributed memory CAF applications.  This environment variable is only honored by a shared memory CAF image.  Only the -n option in the CAF config file is used to control the number of CAF images for a distributed memory CAF application.<br /><br />Again, read the mpiexec documentation in the Intel MPI documentation set.<br /><br />3) <b>Cleanup/shutdown</b>:  run command 'mpdallexit' to kill the cluster runtime daemons.  This is the opposite of the 'mpdboot' command.<br /><br /><span ><b>Known Issues or Limitations : </b></span><br /><b></b><br />Many clusters have multiple MPI implementations installed along with Intel MPI.  The PATH variable, LD_LIBRARY_PATH MUST have Intel MPI paths BEFORE any other MPI installed on your system.  Make sure to ONLY source the mpivars.sh script to set this correctly OR insure that the correct Intel MPI paths appear before other MPI paths.<br /><br />Batch system notes:  In the above notes, we added the option '-envall' to the CAF config file.  This is an attempt to get your current working environment variables to be inherited by your spawned remote CAF processes.  This was done to help insure that your PATH and LD_LIBRARY_PATH contain the paths to Intel MPI and Intel Fortran AND those paths appear before other MPI and compilers on your system.  HOWEVER, some batch scheduling systems will not allow environment inheritence: in other words they will throw out your current environment variables and use defaults for these.  That is why we suggested adding the 'source &lt;path to intel MPI&gt;/[ia32 | intel64]/bin/mpivars.sh' to your .bashrc, .cshrc, or .bash_profile.  These dot files are invoked by each child process created, and hence, SHOULD set the PATH and LD_LIBRARY_PATH appropriately.   When in doubt, execute 'which mpiexec' interactively, or put 'echo `which mpiexec`' in your batch script to insure the Intel MPI mpiexec is being used.  Other MPI implementation 'mpiexec' commands cannot be used and will cause errors.<br /><br />It is critical to insure that you can execute an Intel MPI application PRIOR to attempting to run an Intel CAF program.  Keep a simple MPI 'hello world' handy to debug your environment.  Here is a sample:<br /><br />
<pre name="code" class="plain">program hello_mpi
implicit none
include 'mpif.h'
integer :: size, rank, ierr, len
integer :: status
character*(MPI_MAX_PROCESSOR_NAME) name

call mpi_init(ierr)
call mpi_comm_size(MPI_COMM_WORLD, size, ierr)
call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
call MPI_GET_PROCESSOR_NAME(name, len, ierr)

write(6, "(*(a,i3))") " MPI: size = ", size, " rank = ", rank
write(6, * ) "host is ", trim(name)

call mpi_finalize(ierr)

end program hello_mpi</pre>
<br />Compile with:   mpiifort -o hello_mpi hello_mpi.f90<br />run with:   mpdboot --rsh=ssh --file=~/mpd.hosts -n 2 ; mpiexec -n 4 ./hello_mpi ; mpdallexit<br /><br />and different number of processes for the -n argument.  Make sure you can run across all the nodes you believe are in your cluster or your batch allocation.<br /><br />READ: the Intel MPI Release Notes and the Getting_Started.pdf documents that come with Intel MPI in the &lt;installdir&gt;/doc/ directory.<br /><br /><br /><span ><b>GETTING HELP</b></span><br /><br />Our User Forums are great places to see current issues and to post questions:<br /><br /><a href="http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/">Intel MPI User Forum</a><br /><a href="http://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/">Intel Fortran User Forum</a><br /></div> ]]></description>
      <link>http://software.intel.com/en-us/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential-guide/</link>
      <pubDate>Fri, 01 Apr 2011 23:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential-guide/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential-guide/</guid>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
    </item>
    <item>
      <title>Distributed memory coarray programs with process pinning</title>
      <description><![CDATA[ <br />
<div id="art_pre_template"><strong>Introduction :</strong> This article describes a method to compile and run a distributed memory coarray program using Intel® Fortran Compiler XE 12.0. An example using Linux* is presented.<br /><br /><br /><strong>Version :</strong> Intel® Fortran Compiler XE 12.0 <br /><br /><br /><strong>Application Notes :</strong> To compile for distributed memory coarrays, use compiler option -coarrays=distributed (Linux* OS) or /Qcoarrays:distributed (Windows* OS).  This requires an Intel® Cluster Toolkit license.  To compile for shared memory coarrays, use compiler option -coarrays=shared (Linux* OS) or /Qcoarrays:shared (Windows* OS).  Compiling for shared memory coarrays does not require an Intel® Cluster Toolkit license.<br /><br /><br /><strong>Obtaining Source Code :</strong> The coarray example from the Composer XE 'coarray_samples' directory could be used.<br /><br /><br /><strong>Prerequisites :</strong> An Intel® Cluster Toolkit license is required for compilation, and the Intel® MPI Library must be installed on the cluster nodes.<br /><br /><br /><strong>Configuration Set Up :</strong> A key for running a distributed memory coarray program with process pinning on specific nodes is to use compiler option -coarray-config-file=filename (Linux* OS)or /Qcoarray-config-file:filename (Windows* OS).  This enables you to take full advantage of Intel® MPI Library features in the coarrays environment, in the same way that 'mpiexec -config filename' allows mpiexec to take its commands from 'filename'.<br /><br />The contents of the configuration file for this example:<br /><br />-host host1 -env I_MPI_PIN_PROCESSOR_LIST 0,2,4 -n 3 &lt;path to executable&gt;coarry_dist_host.x : -host host2 -env I_MPI_PIN_PROCESSOR_LIST 1,3,5 -n 3 &lt;path to executable&gt;coarry_dist_host.x<br /><br />This says to execute six coarray images 'coarry_dist_host.x' on nodes host1 and host2, using processors 0,2,4 on host1, and processors 1,3,5 on host2.  The I_MPI_PIN_PROCESSOR_LIST environment variable is used to achieve the process pinning on the indicated nodes.<br /><br /><br /><strong>Source Code Changes :</strong> See <strong>Verifying</strong> <strong>Correctness</strong><br /><br /><br /><strong>Building the Application :</strong> Compile for distributed coarrays, create one coarray image, and specify the coarray configuration file:<br /><br />ifort -coarray=distributed -coarray-num-images=1 -coarray-config-file=coarray_config.txt coarry_dist_host.f90 -o coarry_dist_host.x<br /><br /><strong>Running the Application :</strong> Simply specify the name of the executable:<br />&gt; &lt;path to executable&gt;/coarry_dist_host.x<br />Hello from image 1 out of 6<br />total images, and running on host: host1<br /><br />Hello from image 2 out of 6<br />total images, and running on host: host1<br /><br />Hello from image 3 out of 6<br />total images, and running on host: host1<br /><br />Hello from image 5 out of 6<br />total images, and running on host: host2<br /><br />Hello from image 4 out of 6<br />total images, and running on host: host2<br /><br />Hello from image 6 out of 6<br />total images, and running on host: host2<br />&gt;<br /><br /><strong>Verifying Correctness :</strong> Embed 'call hostnm(hostname)' in your coarray program, then print 'hostname' to verify the images are executed on the correct nodes/processors.<br /><br /><br /><strong>Benefits :</strong> This method enables coarray image pinning on specific nodes/node processors.  Better load balance across cluster nodes might be obtained, or a subset of nodes easily partitioned.<br /><br /><br /><strong>Known Issues or Limitations :</strong> <br />-- Some users have reported MPI environment issues when trying to run the executable in a standalone fashion.  These issues are under investigation, but as a workaround try using mpiexec to launch the executable.  <br />--Distributed memory coarrays only work with Intel® MPI; other implementations of MPI are not supported.<br /></div>
			 ]]></description>
      <link>http://software.intel.com/en-us/articles/distributed-memory-coarray-programs-with-process-pinning/</link>
      <pubDate>Sun, 27 Feb 2011 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/distributed-memory-coarray-programs-with-process-pinning/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/distributed-memory-coarray-programs-with-process-pinning/</guid>
      <category>Parallel Programming</category>
      <category>Intel® Cluster Toolkit for Linux* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® MPI Library for Linux* Knowledge Base</category>
      <category>Intel® MPI Library for Windows* Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>Step-by-Step Application Performance Tuning with Intel Compilers</title>
      <description><![CDATA[ <span class="sectionHeading">Application Performance:  A Step-by-Step Introduction to Application Tuning with Intel® Compilers</span><br /><br /><span class="sectionBodyText">Before you begin performance tuning, you may want to check the correctness of your application by building it without optimization using /Od (Windows*) or -O0 (Linux* or Mac OS* X). In compiler versions 11 and later, all optimization levels assume support for the SSE2 instruction set by default. <br /><br /><span class="sectionHeading">1. </span>Use the general optimization options (Windows /O1, /O2 or /O3; Linux and Mac OS X -O1, -O2, or -O3) and determine which one works best for your application by measuring performance with each. Most users should start at /O2 (–O2), the default, before trying more advanced optimizations. Next, for loop-intensive applications, try /O3 (-O3).  These options are available for both Intel® and non-Intel microprocessors but they may perform more optimizations for Intel microprocessors than they perform for non-Intel microprocessors.<br /><br /><span class="sectionHeading">2.</span> Fine-tune performance to target IA-32 and Intel 64-based systems using processor-specific options. Examples are /QxSSE4.2 (–xsse4.2) for the Intel® Core™ processor family, e.g. the Intel Core i7 processor, and /arch:SSE3 (-msse3) for compatible, non-Intel processors that support at least the SSE3 instruction set. Alternatively, you can use /QxHOST (-xhost) which will use the most advanced instruction set for the processor on which you compiled. This option is available for both Intel® and non-Intel microprocessors but it may perform more optimizations for Intel microprocessors than it performs for non-Intel microprocessors. For a more extensive list and description of options that optimize for specific processors or instruction sets, please see the online article “<a href="http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/" title="SSE generation and processor-specific optimizations">Intel® compiler options for SSE generation and processor-specific optimizations</a>” and the Intel Compiler User and Reference Guides.<br /><br /><span class="sectionHeadingText">3.</span> Add interprocedural optimization (IPO), /Qipo (-ipo) and/or profile-guided optimization (PGO), /Qprof-gen and /Qprof-use (-prof-gen and -prof-use); then measure performance again to determine whether your application benefits from one or both of them.<br /><br /><span class="sectionHeadingText">4.</span> Optimize your application for vector and parallel execution on multi-threaded, multi-core and multi-processor systems using:<br />advice from the new Guided Auto-Parallelism (GAP) feature, /Qguide (-guide); <br />the Intel® Cilk™ Plus language extensions for C/C++;<br />the parallel performance options /Qparallel (-parallel) or /Qopenmp (-openmp);<br />the CoArray feature of Fortran 2008;<br />or by using the Intel® Performance Libraries included with the product. <br />These optimization steps are applicable to both Intel and non-Intel microprocessors, but may result in a greater performance gain on Intel microprocessors than on non-Intel microprocessors.<br /><br /><span class="sectionHeading">5.</span> Use Intel® VTune™ Amplifier XE to help you identify serial and parallel performance “hotspots” so that you know which specific parts of your application could benefit from further tuning. Use Intel® Inspector XE to reduce the time to market for threaded applications by diagnosing memory and threading errors and speeding up the development process. These products cannot be used on non-Intel microprocessors.<br /></span><br />For more details, please consult the main product documentation, e.g. in the <a href="http://software.intel.com/en-us/articles/intel-software-technical-documentation/">Intel® Software Documentation Library</a>. A brief summary of the major optimization options of the Intel Compiler is available in the <a href="http://software.intel.com/sites/products/collateral/hpc/compilers/compiler_qrg12.pdf" title="Quick-Reference Guide to Optimization with Intel® Compilers version 12">Quick-Reference Guide to Optimization with Intel® Compilers version 12</a>. ]]></description>
      <link>http://software.intel.com/en-us/articles/step-by-step-application-performance-tuning-with-intel-compilers/</link>
      <pubDate>Thu, 11 Nov 2010 21:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/step-by-step-application-performance-tuning-with-intel-compilers/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/step-by-step-application-performance-tuning-with-intel-compilers/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
      <category>Intel® VTune™ Amplifier XE Knowledge Base</category>
    </item>
    <item>
      <title>New fast basic random number generator SFMT19937 in Intel MKL</title>
      <description><![CDATA[ <br /><br />Intel MKL 10.3 introduced a new basic generators: a SIMD friendly Fast Mersenne Twister pseudorandom number <strong>SFMT19937</strong> generator.<br /><br /><strong>SFMT19937</strong> is analogous to Mersenne Twister (MT) basic generators. But it can take the advantage of SIMD instructions and provide the fast implementation in the processors. <br /><br /><br />To learn more information on SFMT algorithm, please check the bellow article.<br /><br /><em>Saito, M., and Matsumoto, M. SIMD-oriented Fast Mersenne Twister: a 128-bit Pseudorandom Number Generator. Monte Carlo and Quasi-Monte Carlo Methods 2006, Springer, Pages 607 – 622, 2008.<br /></em><a href="http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/earticles.html"><em>http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/earticles.html</em></a><br /><br /><br />The following is an example application using Intel MKL SFMT19937<br /><br /><br />
<pre name="code" class="cpp">#include &lt;stdio.h&gt;
#include “mkl_vsl.h”
 
int main()
{
   double r[1000]; /* buffer for random numbers */
   double s; /* average */
   VSLStreamStatePtr stream;
   int i, j;
    
   /* Initializing */        
   s = 0.0;
   vslNewStream( &amp;stream, VSL_BRNG_SFMT19937, 777 );
    
   /* Generating */        
   for ( i=0; i&lt;10; i++ );
   {
      vdRngGaussian( VSL_RNG_METHOD_GAUSSIAN_ICDF, stream, 1000, r, 5.0, 2.0 );
      for ( j=0; j&lt;1000; j++ );
      {
         s += r[j];
      }
   }
   s /= 10000.0;
    
   /* Deleting the stream */        
   vslDeleteStream( &amp;stream );
    
   /* Printing results */        
   printf( “Sample mean of normal distribution = %f\n”, s );
    
   return 0;
}<br /><br /><br />
</pre>
<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/new-fast-basic-random-number-generator-sfmt19937-in-intel-mkl/</link>
      <pubDate>Sat, 06 Nov 2010 11:30:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/new-fast-basic-random-number-generator-sfmt19937-in-intel-mkl/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/new-fast-basic-random-number-generator-sfmt19937-in-intel-mkl/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Linux* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Denormal paths speedup in VML by setting FTZ/DAZ setting</title>
      <description><![CDATA[ <p>Intel® MKL VML accuracy setting mode variable is extended with a new setting from Intel MKL 10.3 onwards.</p>
<p>Users can turn ON or OFF this setting by using VML_FTZDAZ_ON / VML_FTZDAZ_OFF (default) in VML functions.</p>
<p>VML_FTZDAZ_ON mode improves performance of computations that involve denormalized numbers at the cost of reasonable accuracy loss.</p>
<p>Enabling this mode changes numerical behavior of the functions:  denormalized input values may be treated as zeros and denormalized results may flush to zero.  Accuracy loss may occur if input and/or output values are close to denormal range.</p>
<p>Usage example:</p>
<p>vmlSetMode( VML_LA | VML_FTZDAZ_ON);</p>
<p>vmdExp(1000, a, r, VML_LA | VML_FTZDAZ_ON);</p>
<br /><br /><br />
<p>
<table cellpadding="5" cellspacing="0" rules="none" border="1">
<tbody>
<tr>
<th align="left" valign="middle" >Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>
<p>Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.</p>
<p align="right">Notice revision #20110804</p>
</td>
</tr>
</tbody>
</table>
 ]]></description>
      <link>http://software.intel.com/en-us/articles/denormal-paths-speedup-in-vml-by-setting-ftzdaz-setting/</link>
      <pubDate>Sat, 06 Nov 2010 11:30:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/denormal-paths-speedup-in-vml-by-setting-ftzdaz-setting/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/denormal-paths-speedup-in-vml-by-setting-ftzdaz-setting/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>Intel® AVX optimization in Intel® MKL</title>
      <description><![CDATA[ Intel ® AVX (Intel ® Advanced Vector Extensions) is the next step in the evolution of Intel processors. Intel® MKL had Intel® AVX optimization since Intel MKL 10.2, however to activate Intel AVX code in version 10.2, users needed to use mkl_enable_instructions(). Starting from Intel MKL 10.3, the Intel AVX code will be dispatched automatically and does not require special activation. In Intel MKL 10.3, Intel AVX optimization has been extended to DGEMM/SGEMM, radix-2 Complex-to-Complex FFT, most of real VML functions and VSL distribution generators.<br /><br />The special cases illustrating speed-ups can be achieved on Intel AVX-enabled processors running an Intel AVX-enabled operating systems over Intel® Xeon® Processor 6000 and 7000 Sequence (Server) in Intel MKL 10.3 are as following:<br /><br />Intel AVX DGEMM (M, N, K=8Kx4Kx128) performs 1.8x over Intel® Xeon® Processor 6000 and 7000 Sequence (Server). <br /><br />Intel AVX DGEMM/SGEMM achieves 88-90% machine peak.<br /><br />The Intel AVX/NHM speedup is 1.8x for radix-2 1D cluster FFTs  with N=1024<br /><br />The Intel® Optimized LINPACK benchmark, using Intel AVX optimizations, performs over 1.86x (or over 80% overall efficiency) on 4 cores with N=20000.<br /><br /><br />
<table cellpadding="5" cellspacing="0" rules="none" border="1">
<tbody>
<tr>
<th align="left" valign="middle" >Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>
<p>Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.</p>
<p align="right">Notice revision #20110804</p>
</td>
</tr>
</tbody>
</table> ]]></description>
      <link>http://software.intel.com/en-us/articles/intel-avx-optimization-in-intel-mkl-v103/</link>
      <pubDate>Wed, 03 Nov 2010 11:30:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-avx-optimization-in-intel-mkl-v103/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/intel-avx-optimization-in-intel-mkl-v103/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Linux* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Information about the FTC Decision and Order on the Intel® Compilers Reimbursement Fund</title>
      <description><![CDATA[ Information on the Intel Compiler Reimbursement Fund referenced in Section VII.D of the FTC Decision and Order is available now. Please see the site, <a href="http://www.CompilerReimbursementProgram.com">www.CompilerReimbursementProgram.com</a>, for further information. ]]></description>
      <link>http://software.intel.com/en-us/articles/information-about-the-ftc-decision-and-order-on-the-intel-compilers-reimbursement-fund/</link>
      <pubDate>Mon, 01 Nov 2010 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/information-about-the-ftc-decision-and-order-on-the-intel-compilers-reimbursement-fund/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/information-about-the-ftc-decision-and-order-on-the-intel-compilers-reimbursement-fund/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Software Development Tool Suites for Intel® Atom™ Processor Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Integrated Performance Primitives Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>MKL performance degradation on SGI ALtix UV system with Nehalem EX processor</title>
      <description><![CDATA[ <br />
<div id="art_pre_template"><b>Reference Number : DPD200155507</b><br /><br /><br /><b>Version : Intel® MKL 10.2.Update5 and earlier</b><br /><br /><br /><b>Product : </b><span ><b>Intel® Math Kernel Library (Intel® MKL)</b></span></div>
<div id="art_pre_template"><br /><br /><b>Operating System : </b><br />Red Hat Enterprise Linux* 5 <br />SuSE Linux Enterprise Server* 10<br /><br /><br /><b>Problem Description : </b><br />MKL experiences the performance degradation in the case dgemm when MKL works on<br />on SGI Altix systems with Nehalem-EX CPU.<br />The cause of the problem is that MKL wrong detects number of threads which are available on this type of system<br /><br /><br /><b>Resolution Status : </b><br />The problem has been fixed and is available in the versions of Intel® MKL 10.2 Update 6 and later versions.<br /><br /><br /><i>[DISCLAIMER: The information on this web site is intended for hardware system manufacturers and software developers. Intel does not warrant the accuracy, completeness or utility of any information on this site. Intel may make changes to the information or the site at any time without notice. Intel makes no commitment to update the information at this site. ALL INFORMATION PROVIDED ON THIS WEBSITE IS PROVIDED "as is" without any express, implied, or statutory warranty of any kind including but not limited to warranties of merchantability, non-infringement of intellectual property, or fitness for any particular purpose. Independent companies manufacture the third-party products that are mentioned on this site. Intel is not responsible for the quality or performance of third-party products and makes no representation or warranty regarding such products. The third-party supplier remains solely responsible for the design, manufacture, sale and functionality of its products. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others.]<br /><br /><br /><br /></i></div> ]]></description>
      <link>http://software.intel.com/en-us/articles/mkl-performance-degradation-on-sgi-altix-uv-system-with-nehalem-ex-processor/</link>
      <pubDate>Sun, 18 Jul 2010 11:30:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/mkl-performance-degradation-on-sgi-altix-uv-system-with-nehalem-ex-processor/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/mkl-performance-degradation-on-sgi-altix-uv-system-with-nehalem-ex-processor/</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
  </channel></rss>
