<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Sun, 21 Mar 2010 00:23:14 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/intel-mkl-kb/type/performance-and-optimization/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles feed</title>
    <link>http://software.intel.com/en-us/articles/intel-mkl-kb/performance-and-optimization/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Parallelism in the Intel® Math Kernel Library</title>
      <description><![CDATA[ <h1 class="sectionHeading">Abstract</h1>
Software libraries provide a simple way to get immediate performance benefits on multicore, multiprocessor, and cluster computing systems. The Intel® Math Kernel Library (Intel® MKL) contains a large collection of functions that can benefit math-intensive applications. This chapter will describe how Intel MKL can help programmers achieve superb serial and parallel performance in common application areas. This material is applicable to IA-32 and Intel® 64 processors on Windows*, Linux*, and Mac OS* X operating systems.<br /><br />This article is part of the larger series, "<a href="http://software.intel.com/en-us/articles/intel-guide-for-developing-multithreaded-applications">Intel Guide for Developing Multithreaded Applications</a>," which provides guidelines for developing efficient multithreaded applications for Intel® platforms.<br /><br />
<h1 class="sectionHeading">Background</h1>
Optimal performance on modern multicore and multiprocessor systems is typically attained only when opportunities for parallelism are well exploited and the memory characteristics underlying the architecture are expertly managed. Sequential codes must rely heavily on instruction and register level SIMD parallelism and cache blocking to achieve best performance. Threaded programs must employ advanced blocking strategies to ensure that multiple cores and processors are efficiently used and the parallel tasks evenly distributed. In some instances, out-of-core implementations can be used to deal with large problems that do not fit in memory.<br /><br />
<h1 class="sectionHeading">Advice</h1>
One of the easiest ways to add parallelism to a math-intensive application is to use a threaded, optimized library. Not only will this save the programmer a substantial amount of development time, it will also reduce the amount of test and evaluation effort required. Standardized APIs also help to make the resulting code more portable.<br /><br />Intel MKL provides a comprehensive set of math functions that are optimized and threaded to exploit all the features of the latest Intel® processors. The first time a function from the library is called, a runtime check is performed to identify the hardware on which the program is running. Based on this check, a code path is chosen to maximize use of instruction- and-register level SIMD parallelism and to choose the best cache-blocking strategy. Intel MKL is also designed to be threadsafe, which means that its functions operate correctly when simultaneously called from multiple application threads.<br /><br />Intel MKL is built using the Intel® C++ and Fortran Compilers and threaded using OpenMP*. Its algorithms are constructed to balance data and tasks for efficient use of multiple cores and processors. The following table shows the math domains that contain threaded functions (this information is based on Intel MKL 10.2 Update 3):<br /><br />
<table width="735" cellpadding="0" cellspacing="0" border="0" class="tableformat1">
<tbody style="text-align: left;">
<tr style="text-align: left;">
<td width="230" style="text-align: left;"><b>Linear Algebra</b></td>
<td style="text-align: left;">Used in applications from finite-element analysis engineering codes to modern animation</td>
</tr>
<tr style="text-align: left;">
<td style="text-align: left;"><b>BLAS (Basic Linear Algebra Subprograms)</b></td>
<td style="text-align: left;">All matrix-matrix operations (level 3) are threaded for both dense and sparse BLAS. Many vector-vector (level 1) and matrix-vector (level 2) operations are threaded for dense matrices in 64-bit programs running on the Intel® 64 architecture. For sparse matrices, all level 2 operations except for the sparse triangular solvers are threaded.</td>
</tr>
<tr style="text-align: left;">
<td style="text-align: left;"><b>LAPACK (Linear Algebra Package)</b></td>
<td style="text-align: left;">Several computational routines are threaded from each of the following types of problems: linear equation solvers, orthogonal factorization, singular value decomposition, and symmetric eigenvalue problems. LAPACK also calls BLAS, so even non-threaded functions may run in parallel.</td>
</tr>
<tr style="text-align: left;">
<td style="text-align: left;"><b>ScaLAPACK (Scalable LAPACK)</b></td>
<td style="text-align: left;">A distributed-memory parallel version of LAPACK intended for clusters.</td>
</tr>
<tr style="text-align: left;">
<td style="text-align: left;"><b>PARDISO</b></td>
<td style="text-align: left;">This parallel direct sparse solver is threaded in its three stages: reordering (optional), factorization, and solve (if solving with multiple right-hand sides).</td>
</tr>
</tbody>
</table>
<table width="735" cellpadding="0" cellspacing="0" border="0" class="tableformat1">
<tbody style="text-align: left;">
<tr style="text-align: left;">
<td width="230" style="text-align: left;"><b>Fast Fourier Transforms</b></td>
<td style="text-align: left;">Used for signal processing and applications that range from oil exploration to medical imaging</td>
</tr>
<tr style="text-align: left;">
<td style="text-align: left;"><b>Threaded FFTs (Fast Fourier Transforms)</b></td>
<td style="text-align: left;">Threaded with the exception of 1D real and split-complex FFTs.</td>
</tr>
<tr style="text-align: left;">
<td style="text-align: left;"><b>Cluster FFTs</b></td>
<td style="text-align: left;">Distributed-memory parallel FFTs intended for clusters.</td>
</tr>
</tbody>
</table>
<table width="735" cellpadding="0" cellspacing="0" border="0" class="tableformat1">
<tbody style="text-align: left;">
<tr style="text-align: left;">
<td width="230" style="text-align: left;"><b>Vector Math</b></td>
<td style="text-align: left;">Used in many financial codes</td>
</tr>
<tr style="text-align: left;">
<td style="text-align: left;"><b>VML (Vector Math Library)</b></td>
<td style="text-align: left;">Arithmetic, trigonometric, exponential/logarithmic, rounding, etc.</td>
</tr>
</tbody>
</table>
<br />Because there is some overhead involved in the creation and management of threads, it is not always worthwhile to use multiple threads. Consequently, Intel MKL does not create threads for small problems. The size that is considered small is relative to the domain and function. For level 3 BLAS functions, threading may occur for a dimension as small as 20, whereas level 1 BLAS and VML functions will not thread for vectors much smaller than 1000.<br /><br />Intel MKL should run on a single thread when called from a threaded region of an application to avoid over-subscription of system resources. For applications that are threaded using OpenMP, this should happen automatically. If other means are used to thread the application, Intel MKL behavior should be set using the controls described below. In cases where the library is used sequentially from multiple threads, Intel MKL may have functionality that can be helpful. As an example, the Vector Statistical Library (VSL) provides a set of vectorized random number generators that are not threaded, but which offer a means of dividing a stream of random numbers among application threads. The <span style="font-family: courier;">SkipAheadStream()</span> function divides a random number stream into separate blocks, one for each thread. The <span style="font-family: courier;">LeapFrogStream()</span> function will divide a stream so that each thread gets a subsequence of the original stream. For example, to divide a stream between two threads, the Leapfrog method would provide numbers with odd indices to one thread and even indices to the other.<br /><br />
<h1 class="sectionHeading">Performance</h1>
Figure 1 provides an example of the kind of performance a user could expect from DGEMM, the double precision, general matrix-matrix multiply function included in Intel MKL. This BLAS function plays an important role in the performance of many applications. The graph shows the performance in Gflops for a variety of rectangular sizes. It demonstrates how performance scales across processors (speedups of up to 1.9x on two threads, 3.8x on four threads, and 7.9x on eight threads), as well as achieving nearly 94.3% of peak performance at 96.5 Gflops.<br /><br />
<p style="text-align: center;"><img src="http://software.intel.com/file/24701" /></p>
<br />
<div style="text-align: center;"><b>Figure 1.</b> Performance and scalability of the BLAS matrix-matrix multiply function.<br /></div>
<br />
<h1 class="sectionHeading">Usage Guidelines</h1>
Since Intel MKL is threaded using OpenMP, its behavior can be affected by OpenMP controls. For added control over threading behavior, Intel MKL provides a number of service functions that mirror the OpenMP controls. These functions allow the user to control the number of threads the library uses, either as a whole or per domain (i.e., separate controls for BLAS, LAPACK, etc.). One application of these independent controls is the ability to allow nested parallelism. For example, behavior of an application threaded using OpenMP could be set using the <span style="font-family: courier;">OMP_NUM_THREADS</span> environment variable or <span style="font-family: courier;">omp_set_num_threads()</span> function, while Intel MKL threading behavior was set independently using the Intel MKL specific controls: <span style="font-family: courier;">MKL_NUM_THREADS</span> or <span style="font-family: courier;">mkl_set_num_threads()</span> as appropriate. Finally, for those who must always run Intel MKL functions on a single thread, a sequential library is provided that is free of all dependencies on the threading runtime.<br /><br />Intel® Hyper-Threading Technology is most effective when each thread performs different types of operations and there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria, because the threaded portions of the library execute at high efficiency using most of the available resources and perform identical operations on each thread. Because of that, Intel MKL will by default use only as many threads as there are physical cores.<br /><br />
<h1 class="sectionHeading">Additional Resources</h1>
<a href="http://software.intel.com/en-us/parallel/">Intel® Software Network Parallel Programming Community<br /><br /></a><a href="http://software.intel.com/en-us/intel-mkl">Intel® Math Kernel Library<br /><br /></a><a target="_blank" href="http://www.netlib.org/">Netlib: Information about BLAS, LAPACK, and ScaLAPACK</a> ]]></description>
      <link>http://software.intel.com/en-us/articles/parallelism-in-the-intel-math-kernel-library</link>
      <pubDate>Tue, 09 Mar 2010 08:07:37 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/parallelism-in-the-intel-math-kernel-library#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/parallelism-in-the-intel-math-kernel-library</guid>
      <category>Parallel Programming</category>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Linux* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Windows* Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Different parallelization techniques and Intel® MKL FFT</title>
      <description><![CDATA[ <p>The following techniques can be used to parallelize your applications which use FFT from Intel MKL.  In this article the examples are threaded using Open MP in the user level.</p>
<p><b>1: </b><b>You do not create threads in your application but specify the parallel mode within the FFT module of Intel MKL.</b></p>
<p style="padding-left: 30px;"><b>Example:  Using Intel MKL Internal Threading Mode</b></p>
<p style="padding-left: 30px;">#include "mkl_dfti.h"</p>
<p style="padding-left: 30px;">void main () {</p>
<p style="padding-left: 30px;">float x[200][100];</p>
<p style="padding-left: 30px;">DFTI_DESCRIPTOR_HANDLE my_desc1_handle;</p>
<p style="padding-left: 30px;">MKL_LONG status, len[2];</p>
<p style="padding-left: 30px;">//...put input data into x[j][k] 0&lt;=j&lt;=199, 0&lt;=k&lt;=99</p>
<p style="padding-left: 30px;">len[0] = 200; len[1] = 100;</p>
<p style="padding-left: 30px;">status = DftiCreateDescriptor( &amp;my_desc1_handle, DFTI_SINGLE,DFTI_REAL, 2,len);</p>
<p style="padding-left: 30px;">status = DftiCommitDescriptor(my_desc1_handle);</p>
<p style="padding-left: 30px;">status = DftiComputeForward(my_desc1_handle, x);</p>
<p style="padding-left: 30px;">status = DftiFreeDescriptor(&amp;my_desc1_handle);</p>
<p>}</p>
<p>See <a href="http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-intel-mkl-100-threading/">Intel® MKL 10.0 threading</a> for more information on how to do this.</p>
<p><b>2. </b><b>You create threads in the application yourself and have each thread perform all stages of FFT implementation, including descriptor initialization, FFT computation, and descriptor deallocation.</b></p>
<p>In this case, each descriptor is used only within its corresponding thread. It is recommended to set single-threaded mode for Intel MKL.</p>
<p>Specify the number of threads as below:</p>
<p>set MKL_NUM_THREADS = 1 for Intel MKL to work in the single-threaded mode (recommended) or use <br />mkl_set_num_threads( 1 ) threading control function.<br /><br />set OMP_NUM_THREADS = n where n is the number of cores for the customer program to work in the multi-threaded mode if it is threaded using Open MP.</p>
<p>The configuration parameter DFTI_NUMBER_OF_USER_THREADS must have its default value of 1.</p>
<p><b>Example: Using Parallel Mode with Multiple Descriptors Initialized in a Parallel Region</b></p>
<p>#include "mkl_dfti.h"</p>
<p>#include "omp.h"<br />void main () {</p>
float _Complex x[200][100];
<p style="padding-left: 30px;">MKL_LONG len[2];</p>
<p style="padding-left: 30px;">//...put input data into x[j][k] 0&lt;=j&lt;=199, 0&lt;=k&lt;=99</p>
<p style="padding-left: 30px;">len[0] = 50; len[1] = 100;</p>
// each thread calculates real FFT for matrix (50*100)
<p style="padding-left: 30px;">#pragma omp parallel {</p>
<p style="padding-left: 60px;">DFTI_DESCRIPTOR_HANDLE my_desc_handle;</p>
<p style="padding-left: 60px;">MKL_LONG myStatus;</p>
<p style="padding-left: 60px;">int myID = omp_get_thread_num ();</p>
<p style="padding-left: 60px;">myStatus = DftiCreateDescriptor (&amp;my_desc_handle, DFTI_SINGLE,</p>
<p style="padding-left: 60px;">DFTI_COMPLEX, 2, len);</p>
<p style="padding-left: 60px;">myStatus = DftiCommitDescriptor (my_desc_handle);</p>
<p style="padding-left: 60px;">myStatus = DftiComputeForward (my_desc_handle, &amp;x [myID * len[0]] [0] );</p>
myStatus = DftiFreeDescriptor (&amp;my_desc_handle);<br /><br />
<p style="padding-left: 30px;">} /* End OpenMP parallel region */</p>
<p>}</p>
<br />
<p><b>3. </b><b>You create threads in the application yourself after initializing all FFT descriptors.</b></p>
<p><b></b></p>
<p>This implies that threading is employed for parallel FFT computation only, and the descriptors are released upon return from the parallel region.</p>
<p>In this case, each descriptor is used only within its corresponding thread. It is obligatory to explicitly set the single-threaded mode for Intel MKL, otherwise, the actual number of threads may differ from one, because the DftiCommitDescriptor function is not in a parallel region.</p>
<p> </p>
<p><b>Example: Using Parallel Mode with Multiple Descriptors Initialized in One Thread</b></p>
<p>#include "mkl_dfti.h"</p>
<p>#include "omp.h"</p>
<p>void main (){</p>
<p style="padding-left: 30px;">float _Complex x[200][100];</p>
<p style="padding-left: 30px;">MKL_LONG len[2];</p>
<p style="padding-left: 30px;">MKL_LONG i;</p>
<p style="padding-left: 30px;">//...put input data into x[j][k] 0&lt;=j&lt;=199, 0&lt;=k&lt;=99</p>
<p style="padding-left: 30px;">len[0] = 50; len[1] = 100;</p>
<p style="padding-left: 30px;">DFTI_DESCRIPTOR_HANDLE my_desc_handle[4];</p>
<p style="padding-left: 30px;">MKL_LONG myStatus;</p>
<p style="padding-left: 30px;">for (i=0;i&lt;3;i++) myStatus = DftiCreateDescriptor &amp;my_desc_handle[i], DFTI_SINGLE, DFTI_COMPLEX, 2, len);</p>
<p style="padding-left: 60px;">// each thread calculates real FFT for matrix (50*100)</p>
<p style="padding-left: 60px;">#pragma omp parallel {</p>
<p style="padding-left: 60px;">int myID = omp_get_thread_num ();</p>
<p style="padding-left: 60px;">myStatus = DftiCommitDescriptor (my_desc_handle[myID]);</p>
<p style="padding-left: 60px;">myStatus = DftiComputeForward (my_desc_handle[myID], &amp;x [myID * len[0]] [0] );</p>
<p style="padding-left: 30px;">} /* End OpenMP parallel region */</p>
<p style="padding-left: 30px;">for (i=0;i&lt;3;i++) myStatus = DftiFreeDescriptor (&amp;my_desc_handle[i]);</p>
<p>}</p>
<p> </p>
<p>Specify the number of threads as:</p>
<p>set MKL_NUM_THREADS = 1 for Intel MKL to work in the single-threaded mode (obligatory) or use <br />mkl_set_num_threads( 1 ) threading control function.</p>
<p>set OMP_NUM_THREADS = n where n is the number of cores for the customer program to work in the multi-threaded mode if you are using Open MP for threading.</p>
<p>The configuration parameter DFTI_NUMBER_OF_USER_THREADS must have the default value of 1.</p>
<p> </p>
<p><b>4. You create threads in the application yourself using OpenMP after initializing the only FFT descriptor. </b></p>
<p>This implies that threading is employed for parallel FFT computation only, and the descriptor is released upon return from the parallel region. In this case, each thread uses the same descriptor.</p>
<p>The following example illustrates a parallel user program with a common descriptor used in several threads.</p>
<p><b>Example: Using Parallel Mode with a Common Descriptor</b></p>
<p><b></b></p>
<p>// set number of threads inside Intel MKL:</p>
<p>// since one-threaded mode for Intel MKL is forced automatically</p>
<p>// set OMP_NUM_THREADS = 4 - multi-threaded mode for customer</p>
<p> </p>
<p>#include "mkl_dfti.h"</p>
<p>#include "omp.h"</p>
<p>void main (){</p>
<p style="padding-left: 30px;">float _Complex x[200][100];</p>
<p style="padding-left: 30px;">MKL_LONG status;</p>
<p style="padding-left: 30px;">DFTI_DESCRIPTOR_HANDLE desc_handle;</p>
<p style="padding-left: 30px;">int nThread = omp_get_max_threads ();</p>
<p style="padding-left: 30px;">MKL_LONG len[2];</p>
<p style="padding-left: 30px;">//...put input data into x[j][k] 0&lt;=j&lt;=199, 0&lt;=k&lt;=99</p>
<p style="padding-left: 30px;">len[0] = 50; len[1] = 100;</p>
status = DftiCreateDescriptor (&amp;desc_handle, DFTI_SINGLE, DFTI_COMPLEX, 2, len);
<p style="padding-left: 30px;">status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread);</p>
<p style="padding-left: 30px;">status = DftiCommitDescriptor (desc_handle);</p>
<p style="padding-left: 30px;">// each thread calculates real FFT for matrix (50*100)</p>
<p style="padding-left: 30px;">#pragma omp parallel num_threads(nThread){</p>
<p style="padding-left: 60px;">MKL_LONG myStatus;</p>
<p style="padding-left: 60px;">int myID = omp_get_thread_num ();</p>
<p style="padding-left: 60px;">myStatus = DftiComputeForward (desc_handle,  &amp;x [myID * len[0]] [0] );</p>
<p style="padding-left: 30px;">} /* End OpenMP parallel region */</p>
<p style="padding-left: 30px;">status = DftiFreeDescriptor (&amp;desc_handle);</p>
<p>}</p>
<p> </p>
<p>In this case, the number of threads, as well as any other configuration parameter, must not be changed after FFT initialization by the DftiCommitDescriptor() function is done.</p>
<p>In cases "1", "2", and "3", listed above, set the parameter DFTI_NUMBER_OF_USER_THREADS to 1 (its default value), since each particular descriptor instance is used only in a single thread.</p>
<p>In case "4", you must use the DftiSetValue() function to set the DFTI_NUMBER_OF_USER_THREADS to the actual number of FFT computation threads, because multiple threads will be using the same descriptor. If this setting is not done, your program will work incorrectly or fail, since the descriptor contains individual data for each thread.</p>
<p> </p>
<p><b>Warning:</b></p>
<p>• It is not recommended to simultaneously parallelize your program and employ the Intel MKL internal threading because this will slow down the performance. Note that in case "4" above, FFT computation is automatically initiated in a single-threading mode.</p>
<p>• You must not change the number of threads after the DftiCommitDescriptor() function completed FFT initialization.</p>
<p> </p> ]]></description>
      <link>http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft</link>
      <pubDate>Wed, 24 Feb 2010 03:23:35 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>New in Intel® MKL 10.2</title>
      <description><![CDATA[ <p><b>New in Intel® MKL 10.2 Update 4:</b></p>
<ul>
<li>New Features           
<ul>
<li>Introduced the single precision complex absolute value function SCABS1 </li>
<li>Introduced the solver ?DTSVB for diagonally dominant tri-diagonal systems which is up to 2x faster than the general solver with partial pivoting (?GTSV)             
<ul>
<li>Added routines for factorization (?DTTRFB) and the forward/backward substitution (?DTTRSB) of the diagonally dominant tri-diagonal systems </li>
</ul>
</li>
</ul>
</li>
<li>Performance improvements            
<ul>
<li>FFTs             
<ul>
<li>Enhanced performance for transforms which are a multiple of 8 or 13 </li>
<li>Optimized 1D complex cluster FFTs for non-power-of-2 vector lengths </li>
</ul>
</li>
<li>VSL             
<ul>
<li>Convolution and Correlation computations that require decimation show significant improvements </li>
</ul>
</li>
</ul>
</li>
<li>Bug fixes (see <a href="http://software.intel.com/en-us/articles/intel-mkl-102-fixes-list/">fixes list</a>)</li>
</ul>
<p><b>New in Intel® MKL 10.2 Update 3:</b></p>
<ul>
<li>Performance improvements      
<ul>
<li>BLAS: Several Level 1 &amp; 2 BLAS functions newly threaded; Improved scaling for DGEMM for skinny matrices</li>
<li> LAPACK: Improved scalability for LAPACK functions: ?POTRF, ?GEBRD, ?SYTRD, ?HETRD, and ?STEDC</li>
<li>FFTs: Extended threading to small-size multi-dimensional transforms and other cases</li>
<li>VML: Further optimizations: v(s,d)Asin, v(s,d)Acos, v(s,d)Ln, v(s,d)Log10, vsLog1p, v(s/d)Hypot</li>
<li>VSL: Improved performance for viRngPoisson and viRngPoissonV random number generators</li>
</ul>
</li>
<li>Usability/Interface improvements      
<ul>
<li>Improved example programs for uBLAS, Java, FFTW3, LAPACK95, and BLAS95</li>
<li>New 64-bit integer (ILP64) fftw_mpi interfaces for cluster FFTs</li>
</ul>
</li>
<li>Bug fixes (see <a href="http://software.intel.com/en-us/articles/intel-mkl-102-fixes-list/">fixes list</a>)</li>
</ul>
<p><b>New in Intel® MKL 10.2 Update 2:</b></p>
<ul>
<li>Performance improvements         
<ul>
<li>Many improvements in BLAS functions for Intel® Core™ i7 processors, and Intel® Xeon® processor 5300, 5400, and 5500 series </li>
<li>Improved scalability of the following LAPACK functions: ?POTRF, ?GEBRD, ?SYTRD, ?HETRD, and ?STEDC divide and conquer eigensolvers </li>
<li>PARDISO OOC performance has improved significantly for symmetric positive definite matrices </li>
<li>Improved performance for the double precision Sobol generator for dimensions &gt;= 16 </li>
<li>Improvements in many VML functions for Intel® Xeon® processor 5500 series and others: v(s,d)Pow, v(s,d)Ceil/Trunc/Floor, vsSin/Cos/SinCos, and vdSin/Cos/SinCos </li>
<li>Improved scalability of 1D, single precision, complex FFTs and improved performance for small 3D transforms </li>
</ul>
</li>
<li>Usability/Interface improvements         
<ul>
<li>Support for 64-bit integer parameters in FFTW wrappers </li>
<li>Intel MKL is now compatible with the representation of logical values in GCC 4.4.0 </li>
<li>All transpose functions now have a Fortran interface </li>
</ul>
</li>
<li>Bug fixes (see <a href="http://software.intel.com/en-us/articles/intel-mkl-102-fixes-list/">fixes list</a>)</li>
</ul>
<p><b>New in Intel® MKL 10.2:</b></p>
<ul>
<li>New features            
<ul>
<li>LAPACK 3.2 </li>
<li>Introduced implementation of the DZGEMM Extended BLAS function (as described at http://www.netlib.org/blas/blast-forum/). See the description of the ?gemm family of functions in the BLAS section of the reference manual. </li>
<li>PARDISO now supports real and complex, single precision data </li>
</ul>
</li>
<li>Usability/Interface improvements            
<ul>
<li>Sparse matrix format conversion routines which convert between the following formats:            
<ul>
<li>CSR (3-array variation) &lt;-&gt; CSC (3-array variation) </li>
<li>CSR (3-array variation) &lt;-&gt; diagonal format </li>
<li>CSR (3-array variation) &lt;-&gt; skyline </li>
</ul>
</li>
<li>Fortran95 BLAS and LAPACK mod files are now included            
<ul>
<li>Modules are pre-built with the Intel compiler and located in the include directory (see Intel® MKL User's Guide for full path) </li>
<li>Source is still included for use with other compilers </li>
<li>Documentation for these interfaces can be found in the Intel® MKL User's Guide </li>
</ul>
</li>
<li>The FFTW3 interface is now integrated directly into the main libraries            
<ul>
<li>Source code is still included to create wrappers for use with compilers not compatible with the default Intel® Fortran compiler convention for name decoration </li>
<li>See Appendix G of the Reference Manual for information </li>
</ul>
</li>
<li>DFTI_DESCRIPTOR_HANDLE now represents a true type name and can now be referenced as a type in user programs </li>
<li>Added parameter to Jacobi matrix calculation routine in the optimization solver domain to allow access to user data (see the description of the djacobix function in the reference manual for more information) </li>
<li>Added an interface mapping calls to single precision BLAS functions in Intel® MKL (functions with 's' or 'c' initial letter) to 64-bit floating point precision functions has been added on 64-bit architectures (See 'sp2dp' in the Intel® MKL User Guide for more information) </li>
<li>Compatibility libraries (also known as "dummy" libraries) have been removed from this version of the library </li>
</ul>
</li>
<li>Performance improvements            
<ul>
<li>Further threading in BLAS level 1 and 2 functions for Intel® 64 architecture            
<ul>
<li>Level 1 functions (vector-vector): (CS,ZD,S,D)ROT, (C,Z,S,D)COPY, and (C,Z,S,D)SWAP            
<ul>
<li>Increase in performance by up to 1.7-4.7 times over version 10.1 Update 1 on 4-core Intel® Core™ i7 processor depending on data location in cache </li>
<li>Increase in performance by up to 14-130 times over version 10.1 Update 1 on 24-core Intel® Xeon® processor 7400 series system, depending on data location in cache </li>
</ul>
</li>
<li>Level 2 functions (matrix-vector): (C,Z,S,D)TRMV, (S,D)SYMV, (S,D)SYR, and (S,D)SYR2            
<ul>
<li>Increase in performance by up to 1.9-2.9 times over version 10.1 Update 1 on 4-core Intel® Core™ i7 processor, depending on data location in cache </li>
<li>Increase in performance by up to 16-40 times over version 10.1 Update 1 on 24-core Intel® Xeon® processor 7400 series system, depending on data location in cache </li>
</ul>
</li>
</ul>
</li>
<li>Introduced recursive algorithm in 32-bit sequential version of DSYRK for up to 20% performance improvement on Intel® Core™ i7 processors and Intel® Xeon® processors in 5300, 5400, and 7400 series. </li>
<li>Improved LU factorization (DGETRF) by 25% over Intel MKL 10.1 Update 1 for large sizes on the Intel® Xeon® 7460 Processor; small sizes are also dramatically improved </li>
<li>BLAS *TBMV/*TBSV functions now use level 1 BLAS functions to improve performance by up to 3% on Intel® Core™ i7 processors and up to 10% on Intel® Core™2 processor 5300 and 5400 series. </li>
<li>Improved threading algorithms to increase DGEMM performance            
<ul>
<li>up to 7% improvement on 8 threads and up to 50% on 3,5,7 threads on the Intel® Core™ i7 processor </li>
<li>up to 50% improvement on 3 threads on Intel® Xeon® processor 7400 series. </li>
</ul>
</li>
<li>Threaded 1D complex-to-complex FFTs for non-prime sizes </li>
<li>New algorithms for 3D complex-to-complex transforms deliver better performance for small sizes (up to 64x64x64) on 1 or 2 threads </li>
<li>Implemented high-level parallelization of out-of-core (OOC) PARDISO when operating on symmetric positive definite matrices. </li>
<li>Reduced memory use by PARDISO for both in-core and out-of-core on all matrix types            
<ul>
<li>PARDISO OOC now uses less than half the memory previously used in Intel MKL 10.1 for real symmetric, complex Hermitian, or complex symmetric matrices </li>
</ul>
</li>
<li>Parallelized Reordering and Symbolic factorization stage in PARDISO/DSS </li>
<li>Up to 2 times better performance (30% improvement on average) on Intel® Core™ i7 and Intel® Core™2 processors for the following VML functions: v(s,d)Round, v(s,d)Inv, v(s,d)Div, v(s,d)Sqrt, v(s,d)Exp, v(s,d)Ln, v(s,d)Atan, v(s,d)Atan2 </li>
<li>Optimized versions of the following functions available for Intel® Advanced Vector Extensions (Intel® AVX)            
<ul>
<li>BLAS: DGEMM </li>
<li>FFTs </li>
<li>VML: exp, log, and pow </li>
<li>See important information in the Intel® MKL User's Guide regarding the mkl_enable_instructions() function for access to these functions </li>
</ul>
</li>
</ul>
</li>
</ul> ]]></description>
      <link>http://software.intel.com/en-us/articles/new-in-intel-mkl-10-2</link>
      <pubDate>Fri, 19 Feb 2010 14:45:43 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/new-in-intel-mkl-10-2#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/new-in-intel-mkl-10-2</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Using Intel® Math Kernel Library and Intel® Integrated Performance Primitives in the Microsoft* .NET* Framework</title>
      <description><![CDATA[ <p class="Default">Intel® Performance Libraries such as Intel® MKL and Intel® IPP are unmanaged code libraries. They are written in native programming languages and compiled to machine code which can run on a target computer directly. This article is intended to educate Intel® MKL and Intel® IPP users on the basics of calling these libraries from .NET Framework languages such as C#. <br /><br /><strong>To download complete article click here -  </strong><a target="_top" href="http://software.intel.com/file/24529"><strong>323195-001US.pdf</strong></a></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/using-intel-math-kernel-library-and-intel-integrated-performance-primitives-in-the-microsoft-net-framework</link>
      <pubDate>Wed, 06 Jan 2010 00:59:04 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-intel-math-kernel-library-and-intel-integrated-performance-primitives-in-the-microsoft-net-framework#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-intel-math-kernel-library-and-intel-integrated-performance-primitives-in-the-microsoft-net-framework</guid>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Integrated Performance Primitives Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    </item>
    <item>
      <title>Performance slow down when dynamically linking with Intel MKL</title>
      <description><![CDATA[ <p><br /><br />When dynamically linking with Intel MKL, users may find performance slow down in a few test applications. This is because loading dynamic libraries at runtime brings overhead to the application. The problem only happens when data set in the application is very small, and the overhead of shared library load cannot be ignored. <br /><br />The following conditions can help to identify this problem: 1) the data set is small in the application. 2) The second run may have better performance than the first run. 3) The problem does not happen when statically linking with Intel MKL.<br /><br />Note, when users benchmark small data with test application, the shared library load may impact performance result. Users can link MKL statically to avoid the problem. In real applications, overhead of shared library load is negligible.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/performance-slow-down-when-dynamically-linking-with-intel-mkl</link>
      <pubDate>Tue, 24 Nov 2009 18:39:33 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/performance-slow-down-when-dynamically-linking-with-intel-mkl#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/performance-slow-down-when-dynamically-linking-with-intel-mkl</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Intel MKL Threaded Functions</title>
      <description><![CDATA[ <p><br />Intel MKL is threaded extensively for different domains. The threaded function list includes:</p>
<table cellpadding="0" cellspacing="0" border="1" style="width: 646px; height: 592px;">
<tbody>
<tr>
<td width="142" valign="top">
<p><b>Domain </b></p>
</td>
<td width="423" valign="top">
<p><b> Threaded Functions</b></p>
</td>
</tr>
<tr>
<td width="142" valign="top">
<p>Direct sparse solver</p>
</td>
<td width="423" valign="top">
<p><br />PARDISO Interface Routines</p>
<p>DSS Interface Routines</p>
</td>
</tr>
<tr>
<td width="142" valign="top">
<p>LAPACK</p>
</td>
<td width="423" valign="top">
<p align="left"><br />Linear equations, computational routines:<br />- factorization: *getrf, *gbtrf, *potrf, *pptrf, *sytrf, *hetrf, *sptrf, *hptrf<br />- solving: *gbtrs, *gttrs, *pptrs, *pbtrs, *pttrs, *sytrs, *sptrs, *hptrs, *tptrs, *tbtrs</p>
<p align="left">Orthogonal factorization, computational routines:<br />*geqrf, *ormqr, *unmqr, *ormlq, *unmlq, *ormql, *unmql, *ormrq, *unmrq.</p>
<p>Singular Value Decomposition, computational routines: *gebrd, *bdsqr</p>
<p align="left">Symmetric Eigenvalue Problems, computational routines:<br />*sytrd, *hetrd, *sptrd, *hptrd, *steqr, *stedc</p>
</td>
</tr>
<tr>
<td width="142" valign="top">
<p>BLAS</p>
</td>
<td width="423" valign="top">
<p align="left"><br />Level1 BLAS: *axpy, *copy, *swap, ddot/sdot, drot/srot</p>
<p>Level2 BLAS: *gemv, *trmv, dsyr/ssyr, dsyr2/ssyr2, dsymv/ssymv</p>
<p>All Level 3 BLAS and all Sparse BLAS routines are threaded except Level 2 triangular solvers</p>
</td>
</tr>
<tr>
<td width="142" valign="top">
<p>VML</p>
</td>
<td width="423" valign="top">
<p align="left"><br />All Mathematical functions except the following are threaded:    Pack/Unpack family, Rounding family, Add, Sub, Mul, real Abs, Sqr</p>
</td>
</tr>
<tr>
<td width="142" valign="top">
<p>FFT</p>
</td>
<td width="423" valign="top">
<p align="left">DFT transform are threaded except the following:</p>
<p align="left">1)Real 1D are not threaded<br /> 2)Split-complex 1Ds are not threaded<br /> 3)Small multidimensional transforms are not threaded</p>
</td>
</tr>
</tbody>
</table>
<p> </p>
<p align="left">Note 1:  A number of <i>other </i>LAPACK routines, which are based on threaded LAPACK or BLAS routines, make effective use of parallelism: *gesv, *posv, *gels, *gesvd,*syev, *heev, etc.<br /> <br />Note 2: Level1 BLAS  and Level2 BLAS are threaded only for:<br /> 1) Intel® 64 architecture<br /> 2) Intel® Core<sup>TM</sup>2 Duo and Intel® Core<sup>TM</sup> i7 processors</p>
<p align="left">Note 3: all data provided into this article are relevant to the latest MKL version 10.2.</p>
<p align="left">Note 4: The list of threaded functions into the current version of MKL you can find into MKL User's Guide ( see chapter 6 - "Using Intel(R) MKL Parallelism")</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/intel-mkl-threaded-functions</link>
      <pubDate>Tue, 24 Nov 2009 18:05:14 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-mkl-threaded-functions#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/intel-mkl-threaded-functions</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Enabling Intel® Advanced Vector Extensions (Intel® AVX) optimizations in Intel® MKL</title>
      <description><![CDATA[ <p><b>Enabling Intel<sup>®</sup> Advanced Vector Extensions (Intel<sup>®</sup> AVX) optimizations in Intel<sup>®</sup> MKL</b></p>
<p>Intel® MKL 10.2 has early support of Intel® AVX and includes special optimizations for several BLAS, DFT and VML functions. Intel® AVX code in Intel® MKL corresponds to the programmer's reference 319433-004 available on the <a href="http://software.intel.com/sites/avx/">Intel® AVX</a> page. To prevent this code from running on actual Intel® AVX hardware, which may correspond to updated version of the specification, Intel® MKL blocks dispatching of Intel® AVX code unless it requested explicitly via an enabling function call. This document provides a step-by-step guide for enabling the optimizations in Intel® MKL.</p>
<p><b><em>Functions with Intel® AVX optimizations</em></b></p>
<p>
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td width="111" valign="top">
<p><b>Domain</b></p>
</td>
<td width="113" valign="top">
<p><b>Functions</b></p>
</td>
</tr>
<tr>
<td width="111" valign="top">
<p>BLAS</p>
</td>
<td width="113" valign="top">
<p>dgemm</p>
</td>
</tr>
<tr>
<td width="111" valign="top">
<p>FFT</p>
</td>
<td width="113" valign="top">
<p>all</p>
</td>
</tr>
<tr>
<td width="111" valign="top">
<p>VML</p>
</td>
<td width="113" valign="top">
<p>exp, log, pow</p>
</td>
</tr>
</tbody>
</table>
</p>
<p><b><em>Building an Intel® AVX enabled application with Intel® MKL</em></b></p>
<p>To enable Intel® AVX optimizations in an Intel® MKL application, add a call to the "mkl_enable_instructions" function before any other Intel® MKL call. Then the application will benefit from Intel® AVX optimizations when executed on Intel® SDE emulator.</p>
<p>The "mkl_enable_instructions" function is declared in mkl_service.fi for FORTRAN 77 interface and in mkl_service.h for C interface and has the following syntax:</p>
<p>irc = mkl_enable_instructions(MKL_AVX_ENABLE)</p>
<p>The function returns ‘0' if the function have failed, for instance when it has been called after another Intel® MKL function. Otherwise the return value is ‘1'.</p>
<p>With the call added compile the application and link it with Intel® MKL in usual way. See Intel® MKL User's Guide for details.</p>
<p><b><em>Running an Intel® AVX enabled application</em></b></p>
<p>Exploring Intel® AVX starts with <a href="http://software.intel.com/en-us/articles/pre-release-license-agreement-for-intel-software-development-emulator-accept-end-user-license-agreement-and-download">downloading Intel® SDE</a> emulator from WhatIf site <a href="http://whatif.intel.com/">http://whatif.intel.com/</a>. Intel® SDE does not require installation, download and unpack a package suiting your system (Intel® 64 and Linux* or Windows* OS are supported by Intel® MKL 10.2).</p>
<p>With Intel® SDE in place, run your Intel® AVX enabled application via the emulator to get the new instructions working.</p>
<p>Linux*:</p>
<p>&lt;path to SDE&gt;/sde -- &lt;application&gt; [args]</p>
<p>Windows*:</p>
<p>&lt;path to SDE&gt;\sde.exe -- &lt;application&gt; [args]</p>
<p>Refer to <a href="http://software.intel.com/en-us/articles/intel-software-development-emulator/">Intel® SDE page</a> for detailed information on its how to use Intel® SDE for application analysis.</p>
<p><b><em>Code sample</em></b></p>
<p>This sample C code demonstrates multiplication of two 8x8 matrices using Intel® AVX optimized DGEMM from Intel® MKL 10.2.</p>
<p>#include &lt;stdio.h&gt;</p>
<p>#include "mkl_service.h"</p>
<p>#include "mkl_cblas.h"</p>
<p>#include "mkl_types.h"</p>
<p> </p>
<p>int main(void)</p>
<p>{</p>
<p>    MKLVersion ver;</p>
<p> </p>
<p>    const MKL_INT n = 8;</p>
<p>    double        alpha = 1.0, beta = 1.0;</p>
<p>    double        a[n*n], b[n*n], c[n*n];</p>
<p>    int i;</p>
<p> </p>
<p>// Enable Intel® AVX optimizations in Intel® MKL</p>
<p>     if ( MKL_Enable_Instructions(MKL_AVX_ENABLE) ) {</p>
<p>           puts("Intel(R)  AVX optimizations are enabled.");</p>
<p>     } else {</p>
<p>           puts("Intel(R)  AVX optimizations are not enabled. MKL_Enable_Instructions was called after another MKL function.");</p>
<p>     }</p>
<p> </p>
<p>// Print information on CPU optimization in effect</p>
<p>     MKLGetVersion(&amp;ver);</p>
<p>     printf("Processor optimization: %s\n",ver.Processor);</p>
<p> </p>
<p>// Generate matrices</p>
<p>    for (i=0;i &lt; n*n;i++) {</p>
<p>           a[i] = i;</p>
<p>           b[i] = n*n-i;</p>
<p>     }</p>
<p>// Call Intel® MKL DGEMM function</p>
<p>     cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, alpha, a, n, b, n, beta, c, n);</p>
<p> </p>
<p>}</p>
<p> </p>
<p>When linked with Intel® MKL 10.2 (see User's Guide and MKL release examples for linking instructions) and executed on Intel® SDE this code produces the following output:</p>
<p>Intel(R) AVX optimizations are enabled.</p>
<p>Processor optimization: Intel(R) Advanced Vector Extensions (Intel(R) AVX) Enabled Processor.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/enabling-avx-optimizations-in-mkl</link>
      <pubDate>Mon, 10 Aug 2009 11:05:47 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/enabling-avx-optimizations-in-mkl#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/enabling-avx-optimizations-in-mkl</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Intel® MKL Threaded 1D FFTs</title>
      <description><![CDATA[ <p>In Intel MKL 10.2, one-dimensional complex-to-complex fast Fourier transforms (FFTs) are now threaded for non-prime sizes from 2<sup>16</sup> and larger with the following exception:</p>
<ul>
<li>if at least one prime factor is larger than 2<sup>29</sup>, the transform is not supported</li>
</ul>
<p>An Intel MKL 1D FFT of size K=N*M (where N and M are chosen somewhat arbitrarily) is computed using the Cooley-Tukey factorization algorithm in 2 stages. During the first stage, N transforms of size M are performed in parallel with necessary permutation. For the second stage, multiplication by twiddle factors and M transforms of size N are done in parallel. Care is taken to avoid cache line splits between the threads for both stages. Additionally, the size of the twiddle factor table is reduced by a binary split.</p>
<p>When running on an Intel® Xeon® Processor X5492 (2 x 4-core, 3.4 GHz) running a 64-bit operating system FFT performance improved by up to 5 times for 8 threads, when compared to Intel MKL 10.1 Update 2.</p>
<p>Customers just need to upgrade to the Intel MKL 10.2 in order to take advantage of this new feature for multi-threaded applications. No code changes are required.</p>
<p>For more information the Intel MKL FFTs, please refer to <a target="_blank" href="http://software.intel.com/sites/products/documentation/hpc/mkl/mklman.pdf">Chapter 11 of the Reference Manual</a>. Also refer to <a target="_blank" href="http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/">Intel MKL Linking Advisor</a> for details on how to link to the FFT functions.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/mkl-threaded-1d-ffts</link>
      <pubDate>Fri, 17 Jul 2009 16:37:27 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/mkl-threaded-1d-ffts#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/mkl-threaded-1d-ffts</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Setting thread affinity on SMT or HT enabled systems for better performance</title>
      <description><![CDATA[ Simultaneous MultiThreading (SMT) or Hyper-Threading Technology (HT Technology) is especially effective when each thread is performing different types of operations and when there are under-utilized resources on the processor.  However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread.  You may obtain higher performance by disabling HT/SMT Technology.  See Using the Intel® MKL Parallelism for information on the default number of threads, changing this number, and other relevant details.<br /><br />If you run with SMT/HT enabled, performance may be especially impacted if you run on fewer threads than physical cores.  Moreover, if, for example, there are two threads to every physical core, the thread scheduler may assign two threads to some cores and ignore the other ones altogether. If you are using the OpenMP* library of the Intel Compiler, read the respective User Guide on how to best set the thread affinity interface to avoid this situation. <br /><br /><br />For Intel MKL, you are recommended to set<br /><br />KMP_AFFINITY=granularity=fine,compact,1,0. ]]></description>
      <link>http://software.intel.com/en-us/articles/setting-thread-affinity-on-smt-or-ht-enabled-systems</link>
      <pubDate>Mon, 22 Jun 2009 22:59:30 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/setting-thread-affinity-on-smt-or-ht-enabled-systems#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/setting-thread-affinity-on-smt-or-ht-enabled-systems</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Intel® MKL LAPACK 3.2</title>
      <description><![CDATA[ <p>LAPACK 3.2 features have been added to Intel® MKL 10.2. LAPACK 3.2 introduces the following new features as described on netlib.org:</p>
<ul>
<li>Extra Precise Iterative Refinement </li>
<li>Non-Negative Diagonals from Householder QR </li>
<li>High Performance QR and Householder Reflections on Low-Profile Matrices </li>
<li>New fast and accurate Jacobi SVD </li>
<li>Routines for Rectangular Full Packed format </li>
<li>Pivoted Cholesky </li>
<li>Mixed precision iterative refinement routines for exploiting fast single precision hardware </li>
<li>Some new variants added for the one sided factorization </li>
<li>More robust DQDS algorithm</li>
</ul> ]]></description>
      <link>http://software.intel.com/en-us/articles/intel-mkl-lapack-32</link>
      <pubDate>Mon, 22 Jun 2009 20:41:18 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-mkl-lapack-32#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/intel-mkl-lapack-32</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
  </channel></rss>