<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Tue, 24 Nov 2009 20:04:49 -0800 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/intel-mkl-kb/type/performance-and-optimization/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles feed</title>
    <link>http://software.intel.com/en-us/articles/intel-mkl-kb/performance-and-optimization/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Performance slow down when dynamically linking with Intel MKL</title>
      <description><![CDATA[ <br /><br />When dynamically linking with Intel MKL, users may find performance slow down in a few applications. This is because loading dynamic libraries at the runtime bring overheard to the application. This only happens when the data set is very small, and the overheard of loading DLLs (or shared library in Linux) cannot be ignored compared with overall application performance. <br /><br />The following conditions can help to identify if this is the problem at your application: 1) The data set is small in the application. 2) The second run may have better performance than the first run. 3) The problem does not happen when statically linking with Intel MKL. <br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/performance-slow-down-when-dynamically-linking-with-intel-mkl</link>
      <pubDate>Tue, 24 Nov 2009 18:39:33 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/performance-slow-down-when-dynamically-linking-with-intel-mkl#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/performance-slow-down-when-dynamically-linking-with-intel-mkl</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Intel MKL Threaded Functions</title>
      <description><![CDATA[ <p> <br />Intel MKL is threaded extensively for different domains. The threaded function list includes: </p>
<table cellpadding="0" cellspacing="0" border="1" style="width: 646px; height: 592px;">
<tbody>
<tr>
<td width="142" valign="top">
<p><b>Domain </b></p>
</td>
<td width="423" valign="top">
<p><b> Threaded Functions</b></p>
</td>
</tr>
<tr>
<td width="142" valign="top">
<p>Direct sparse solver</p>
</td>
<td width="423" valign="top">
<p><br />PARDISO Interface Routines</p>
<p>DSS Interface Routines</p>
</td>
</tr>
<tr>
<td width="142" valign="top">
<p>LAPACK</p>
</td>
<td width="423" valign="top">
<p align="left"><br />Linear equations, computational routines:<br />- factorization: *getrf, *gbtrf, *potrf, *pptrf, *sytrf, *hetrf, *sptrf, *hptrf<br />- solving: *gbtrs, *gttrs, *pptrs, *pbtrs, *pttrs, *sytrs, *sptrs, *hptrs, *tptrs, *tbtrs</p>
<p align="left">Orthogonal factorization, computational routines:<br />*geqrf, *ormqr, *unmqr, *ormlq, *unmlq, *ormql, *unmql, *ormrq, *unmrq.</p>
<p>Singular Value Decomposition, computational routines: *gebrd, *bdsqr</p>
<p align="left">Symmetric Eigenvalue Problems, computational routines:<br />*sytrd, *hetrd, *sptrd, *hptrd, *steqr, *stedc</p>
</td>
</tr>
<tr>
<td width="142" valign="top">
<p>BLAS</p>
</td>
<td width="423" valign="top">
<p align="left"><br />Level1 BLAS: *axpy, *copy, *swap, ddot/sdot, drot/srot</p>
<p>Level2 BLAS: *gemv, *trmv, dsyr/ssyr, dsyr2/ssyr2, dsymv/ssymv</p>
<p>All Level 3 BLAS and all Sparse BLAS routines are threaded except Level 2 triangular solvers</p>
</td>
</tr>
<tr>
<td width="142" valign="top">
<p>VML</p>
</td>
<td width="423" valign="top">
<p align="left"><br />All Mathematical functions except the following are threaded:    Pack/Unpack family, Rounding family, Add, Sub, Mul, real Abs, Sqr</p>
</td>
</tr>
<tr>
<td width="142" valign="top">
<p>FFT</p>
</td>
<td width="423" valign="top">
<p align="left">DFT transform are threaded except the following:</p>
<p align="left"> 1)Real 1D are not threaded<br /> 2)Split-complex 1Ds are not threaded<br /> 3)Small multidimensional transforms are not threaded</p>
</td>
</tr>
</tbody>
</table>
<p> </p>
<p align="left">Note 1:  A number of <i>other </i>LAPACK routines, which are based on threaded LAPACK or BLAS routines, make effective use of parallelism: *gesv, *posv, *gels, *gesvd,*syev, *heev, etc.<br /> <br />Note 2: Level1 BLAS  and Level2 BLAS are threaded only for:<br /> 1) Intel® 64 architecture<br /> 2) Intel® Core<sup>TM</sup>2 Duo and Intel® Core<sup>TM</sup> i7 processors</p>
<p align="left"> </p>
<p align="left"> </p> ]]></description>
      <link>http://software.intel.com/en-us/articles/intel-mkl-threaded-functions</link>
      <pubDate>Tue, 24 Nov 2009 18:05:14 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-mkl-threaded-functions#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/intel-mkl-threaded-functions</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Different parallelization techniques and Intel® MKL FFT</title>
      <description><![CDATA[ <p>The following techniques can be used to parallelize your applications which use FFT from Intel MKL.  In this article the examples are threaded using Open MP in the user level.</p>
<p><b>1: </b><b>You do not create threads in your application but specify the parallel mode within the FFT module of Intel MKL.</b></p>
<p style="padding-left: 30px;"><b>Example:  Using Intel MKL Internal Threading Mode</b></p>
<p style="padding-left: 30px;">#include "mkl_dfti.h"</p>
<p style="padding-left: 30px;">void main () {</p>
<p style="padding-left: 30px;">float x[200][100];</p>
<p style="padding-left: 30px;">DFTI_DESCRIPTOR_HANDLE my_desc1_handle;</p>
<p style="padding-left: 30px;">MKL_LONG status, len[2];</p>
<p style="padding-left: 30px;">//...put input data into x[j][k] 0&lt;=j&lt;=199, 0&lt;=k&lt;=99</p>
<p style="padding-left: 30px;">len[0] = 200; len[1] = 100;</p>
<p style="padding-left: 30px;">status = DftiCreateDescriptor( &amp;my_desc1_handle, DFTI_SINGLE,DFTI_REAL, 2,len);</p>
<p style="padding-left: 30px;">status = DftiCommitDescriptor(my_desc1_handle);</p>
<p style="padding-left: 30px;">status = DftiComputeForward(my_desc1_handle, x);</p>
<p style="padding-left: 30px;">status = DftiFreeDescriptor(&amp;my_desc1_handle);</p>
<p>}</p>
<p>See <a href="http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-intel-mkl-100-threading/">Intel® MKL 10.0 threading</a> for more information on how to do this.</p>
<p><b>2. </b><b>You create threads in the application yourself and have each thread perform all stages of FFT implementation, including descriptor initialization, FFT computation, and descriptor deallocation.</b></p>
<p>In this case, each descriptor is used only within its corresponding thread. It is recommended to set single-threaded mode for Intel MKL.</p>
<p>Specify the number of threads as below:</p>
<p>set MKL_NUM_THREADS = 1 for Intel MKL to work in the single-threaded mode (recommended) or use <br />mkl_set_num_threads( 1 ) threading control function.<br /><br />set OMP_NUM_THREADS = n where n is the number of cores for the customer program to work in the multi-threaded mode if it is threaded using Open MP.</p>
<p>The configuration parameter DFTI_NUMBER_OF_USER_THREADS must have its default value of 1.</p>
<p><b>Example: Using Parallel Mode with Multiple Descriptors Initialized in a Parallel Region</b></p>
<p>Note that in this example, the program can be transformed to become single-threaded at the customer level but using parallel mode within Intel MKL (case "1").  To achieve this, you need to set the parameter DFTI_NUMBER_OF_TRANSFORMS = n where n is the number of cores and to set the corresponding parameter DFTI_INPUT_DISTANCE = 5000.</p>
<p>#include "mkl_dfti.h"</p>
<p>#include "omp.h"<br />void main () {</p>
float _Complex x[200][100];
<p style="padding-left: 30px;">MKL_LONG len[2];</p>
<p style="padding-left: 30px;">//...put input data into x[j][k] 0&lt;=j&lt;=199, 0&lt;=k&lt;=99</p>
<p style="padding-left: 30px;">len[0] = 50; len[1] = 100;</p>
// each thread calculates real FFT for matrix (50*100)
<p style="padding-left: 30px;">#pragma omp parallel {</p>
<p style="padding-left: 60px;">DFTI_DESCRIPTOR_HANDLE my_desc_handle;</p>
<p style="padding-left: 60px;">MKL_LONG myStatus;</p>
<p style="padding-left: 60px;">int myID = omp_get_thread_num ();</p>
<p style="padding-left: 60px;">myStatus = DftiCreateDescriptor (&amp;my_desc_handle, DFTI_SINGLE,</p>
<p style="padding-left: 60px;">DFTI_COMPLEX, 2, len);</p>
<p style="padding-left: 60px;">myStatus = DftiCommitDescriptor (my_desc_handle);</p>
<p style="padding-left: 60px;">myStatus = DftiComputeForward (my_desc_handle, &amp;x [myID * len[0]] [0] );</p>
myStatus = DftiFreeDescriptor (&amp;my_desc_handle);<br /><br />
<p style="padding-left: 30px;">} /* End OpenMP parallel region */</p>
<p>}</p>
<br />
<p><b>3. </b><b>You create threads in the application yourself after initializing all FFT descriptors.</b></p>
<p><b></b></p>
<p>This implies that threading is employed for parallel FFT computation only, and the descriptors are released upon return from the parallel region.</p>
<p>In this case, each descriptor is used only within its corresponding thread. It is obligatory to explicitly set the single-threaded mode for Intel MKL, otherwise, the actual number of threads may differ from one, because the DftiCommitDescriptor function is not in a parallel region.</p>
<p> </p>
<p><b>Example: Using Parallel Mode with Multiple Descriptors Initialized in One Thread</b></p>
<p>#include "mkl_dfti.h"</p>
<p>#include "omp.h"</p>
<p>void main (){</p>
<p style="padding-left: 30px;">float _Complex x[200][100];</p>
<p style="padding-left: 30px;">MKL_LONG len[2];</p>
<p style="padding-left: 30px;">MKL_LONG i;</p>
<p style="padding-left: 30px;">//...put input data into x[j][k] 0&lt;=j&lt;=199, 0&lt;=k&lt;=99</p>
<p style="padding-left: 30px;">len[0] = 50; len[1] = 100;</p>
<p style="padding-left: 30px;">DFTI_DESCRIPTOR_HANDLE my_desc_handle[4];</p>
<p style="padding-left: 30px;">MKL_LONG myStatus;</p>
<p style="padding-left: 30px;">for (i=0;i&lt;3;i++) myStatus = DftiCreateDescriptor &amp;my_desc_handle[i], DFTI_SINGLE, DFTI_COMPLEX, 2, len);</p>
<p style="padding-left: 60px;">// each thread calculates real FFT for matrix (50*100)</p>
<p style="padding-left: 60px;">#pragma omp parallel {</p>
<p style="padding-left: 60px;">int myID = omp_get_thread_num ();</p>
<p style="padding-left: 60px;">myStatus = DftiCommitDescriptor (my_desc_handle[myID]);</p>
<p style="padding-left: 60px;">myStatus = DftiComputeForward (my_desc_handle[myID], &amp;x [myID * len[0]] [0] );</p>
<p style="padding-left: 30px;">} /* End OpenMP parallel region */</p>
<p style="padding-left: 30px;">for (i=0;i&lt;3;i++) myStatus = DftiFreeDescriptor (&amp;my_desc_handle[i]);</p>
<p>}</p>
<p> </p>
<p>Specify the number of threads as:</p>
<p>set MKL_NUM_THREADS = 1 for Intel MKL to work in the single-threaded mode (obligatory) or use <br />mkl_set_num_threads( 1 ) threading control function.</p>
<p>set OMP_NUM_THREADS = n where n is the number of cores for the customer program to work in the multi-threaded mode if you are using Open MP for threading.</p>
<p>The configuration parameter DFTI_NUMBER_OF_USER_THREADS must have the default value of 1.</p>
<p> </p>
<p><b>4. You create threads in the application yourself using OpenMP after initializing the only FFT descriptor. </b></p>
<p>This implies that threading is employed for parallel FFT computation only, and the descriptor is released upon return from the parallel region. In this case, each thread uses the same descriptor.</p>
<p>The following example illustrates a parallel user program with a common descriptor used in several threads.</p>
<p><b>Example: Using Parallel Mode with a Common Descriptor</b></p>
<p><b></b></p>
<p>// set number of threads inside Intel MKL:</p>
<p>// since one-threaded mode for Intel MKL is forced automatically</p>
<p>// set OMP_NUM_THREADS = 4 - multi-threaded mode for customer</p>
<p> </p>
<p>#include "mkl_dfti.h"</p>
<p>#include "omp.h"</p>
<p>void main (){</p>
<p style="padding-left: 30px;">float _Complex x[200][100];</p>
<p style="padding-left: 30px;">MKL_LONG status;</p>
<p style="padding-left: 30px;">DFTI_DESCRIPTOR_HANDLE desc_handle;</p>
<p style="padding-left: 30px;">int nThread = omp_get_max_threads ();</p>
<p style="padding-left: 30px;">MKL_LONG len[2];</p>
<p style="padding-left: 30px;">//...put input data into x[j][k] 0&lt;=j&lt;=199, 0&lt;=k&lt;=99</p>
<p style="padding-left: 30px;">len[0] = 50; len[1] = 100;</p>
status = DftiCreateDescriptor (&amp;desc_handle, DFTI_SINGLE, DFTI_COMPLEX, 2, len);
<p style="padding-left: 30px;">status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread);</p>
<p style="padding-left: 30px;">status = DftiCommitDescriptor (desc_handle);</p>
<p style="padding-left: 30px;">// each thread calculates real FFT for matrix (50*100)</p>
<p style="padding-left: 30px;">#pragma omp parallel num_threads(nThread){</p>
<p style="padding-left: 60px;">MKL_LONG myStatus;</p>
<p style="padding-left: 60px;">int myID = omp_get_thread_num ();</p>
<p style="padding-left: 60px;">myStatus = DftiComputeForward (desc_handle,  &amp;x [myID * len[0]] [0] );</p>
<p style="padding-left: 30px;">} /* End OpenMP parallel region */</p>
<p style="padding-left: 30px;">status = DftiFreeDescriptor (&amp;desc_handle);</p>
<p>}</p>
<p> </p>
<p>In this case, the number of threads, as well as any other configuration parameter, must not be changed after FFT initialization by the DftiCommitDescriptor() function is done.</p>
<p>In cases "1", "2", and "3", listed above, set the parameter DFTI_NUMBER_OF_USER_THREADS to 1 (its default value), since each particular descriptor instance is used only in a single thread.</p>
<p>In case "4", you must use the DftiSetValue() function to set the DFTI_NUMBER_OF_USER_THREADS to the actual number of FFT computation threads, because multiple threads will be using the same descriptor. If this setting is not done, your program will work incorrectly or fail, since the descriptor contains individual data for each thread.</p>
<p> </p>
<p><b>Warning:</b></p>
<p>• It is not recommended to simultaneously parallelize your program and employ the Intel MKL internal threading because this will slow down the performance. Note that in case "4" above, FFT computation is automatically initiated in a single-threading mode.</p>
<p>• You must not change the number of threads after the DftiCommitDescriptor() function completed FFT initialization.</p>
<p> </p> ]]></description>
      <link>http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft</link>
      <pubDate>Sun, 22 Nov 2009 11:57:08 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Enabling Intel® Advanced Vector Extensions (Intel® AVX) optimizations in Intel® MKL</title>
      <description><![CDATA[ <p><b>Enabling Intel<sup>®</sup> Advanced Vector Extensions (Intel<sup>®</sup> AVX) optimizations in Intel<sup>®</sup> MKL</b></p>
<p>Intel® MKL 10.2 has early support of Intel® AVX and includes special optimizations for several BLAS, DFT and VML functions. Intel® AVX code in Intel® MKL corresponds to the programmer's reference 319433-004 available on the <a href="http://software.intel.com/sites/avx/">Intel® AVX</a> page. To prevent this code from running on actual Intel® AVX hardware, which may correspond to updated version of the specification, Intel® MKL blocks dispatching of Intel® AVX code unless it requested explicitly via an enabling function call. This document provides a step-by-step guide for enabling the optimizations in Intel® MKL.</p>
<p><b><em>Functions with Intel® AVX optimizations</em></b></p>
<p>
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td width="111" valign="top">
<p><b>Domain</b></p>
</td>
<td width="113" valign="top">
<p><b>Functions</b></p>
</td>
</tr>
<tr>
<td width="111" valign="top">
<p>BLAS</p>
</td>
<td width="113" valign="top">
<p>dgemm</p>
</td>
</tr>
<tr>
<td width="111" valign="top">
<p>FFT</p>
</td>
<td width="113" valign="top">
<p>all</p>
</td>
</tr>
<tr>
<td width="111" valign="top">
<p>VML</p>
</td>
<td width="113" valign="top">
<p>exp, log, pow</p>
</td>
</tr>
</tbody>
</table>
</p>
<p><b><em>Building an Intel® AVX enabled application with Intel® MKL</em></b></p>
<p>To enable Intel® AVX optimizations in an Intel® MKL application, add a call to the "mkl_enable_instructions" function before any other Intel® MKL call. Then the application will benefit from Intel® AVX optimizations when executed on Intel® SDE emulator.</p>
<p>The "mkl_enable_instructions" function is declared in mkl_service.fi for FORTRAN 77 interface and in mkl_service.h for C interface and has the following syntax:</p>
<p>irc = mkl_enable_instructions(MKL_AVX_ENABLE)</p>
<p>The function returns ‘0' if the function have failed, for instance when it has been called after another Intel® MKL function. Otherwise the return value is ‘1'.</p>
<p>With the call added compile the application and link it with Intel® MKL in usual way. See Intel® MKL User's Guide for details.</p>
<p><b><em>Running an Intel® AVX enabled application</em></b></p>
<p>Exploring Intel® AVX starts with <a href="http://software.intel.com/en-us/articles/pre-release-license-agreement-for-intel-software-development-emulator-accept-end-user-license-agreement-and-download">downloading Intel® SDE</a> emulator from WhatIf site <a href="http://whatif.intel.com/">http://whatif.intel.com/</a>. Intel® SDE does not require installation, download and unpack a package suiting your system (Intel® 64 and Linux* or Windows* OS are supported by Intel® MKL 10.2).</p>
<p>With Intel® SDE in place, run your Intel® AVX enabled application via the emulator to get the new instructions working.</p>
<p>Linux*:</p>
<p>&lt;path to SDE&gt;/sde -- &lt;application&gt; [args]</p>
<p>Windows*:</p>
<p>&lt;path to SDE&gt;\sde.exe -- &lt;application&gt; [args]</p>
<p>Refer to <a href="http://software.intel.com/en-us/articles/intel-software-development-emulator/">Intel® SDE page</a> for detailed information on its how to use Intel® SDE for application analysis.</p>
<p><b><em>Code sample</em></b></p>
<p>This sample C code demonstrates multiplication of two 8x8 matrices using Intel® AVX optimized DGEMM from Intel® MKL 10.2.</p>
<p>#include &lt;stdio.h&gt;</p>
<p>#include "mkl_service.h"</p>
<p>#include "mkl_cblas.h"</p>
<p>#include "mkl_types.h"</p>
<p> </p>
<p>int main(void)</p>
<p>{</p>
<p>    MKLVersion ver;</p>
<p> </p>
<p>    const MKL_INT n = 8;</p>
<p>    double        alpha = 1.0, beta = 1.0;</p>
<p>    double        a[n*n], b[n*n], c[n*n];</p>
<p>    int i;</p>
<p> </p>
<p>// Enable Intel® AVX optimizations in Intel® MKL</p>
<p>     if ( MKL_Enable_Instructions(MKL_AVX_ENABLE) ) {</p>
<p>           puts("Intel(R)  AVX optimizations are enabled.");</p>
<p>     } else {</p>
<p>           puts("Intel(R)  AVX optimizations are not enabled. MKL_Enable_Instructions was called after another MKL function.");</p>
<p>     }</p>
<p> </p>
<p>// Print information on CPU optimization in effect</p>
<p>     MKLGetVersion(&amp;ver);</p>
<p>     printf("Processor optimization: %s\n",ver.Processor);</p>
<p> </p>
<p>// Generate matrices</p>
<p>    for (i=0;i &lt; n*n;i++) {</p>
<p>           a[i] = i;</p>
<p>           b[i] = n*n-i;</p>
<p>     }</p>
<p>// Call Intel® MKL DGEMM function</p>
<p>     cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, alpha, a, n, b, n, beta, c, n);</p>
<p> </p>
<p>}</p>
<p> </p>
<p>When linked with Intel® MKL 10.2 (see User's Guide and MKL release examples for linking instructions) and executed on Intel® SDE this code produces the following output:</p>
<p>Intel(R) AVX optimizations are enabled.</p>
<p>Processor optimization: Intel(R) Advanced Vector Extensions (Intel(R) AVX) Enabled Processor.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/enabling-avx-optimizations-in-mkl</link>
      <pubDate>Mon, 10 Aug 2009 11:05:47 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/enabling-avx-optimizations-in-mkl#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/enabling-avx-optimizations-in-mkl</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Intel® MKL Threaded 1D FFTs</title>
      <description><![CDATA[ <p>In Intel MKL 10.2, one-dimensional complex-to-complex fast Fourier transforms (FFTs) are now threaded for non-prime sizes from 2<sup>16</sup> and larger with the following exception:</p>
<ul>
<li>if at least one prime factor is larger than 2<sup>29</sup>, the transform is not supported</li>
</ul>
<p>An Intel MKL 1D FFT of size K=N*M (where N and M are chosen somewhat arbitrarily) is computed using the Cooley-Tukey factorization algorithm in 2 stages. During the first stage, N transforms of size M are performed in parallel with necessary permutation. For the second stage, multiplication by twiddle factors and M transforms of size N are done in parallel. Care is taken to avoid cache line splits between the threads for both stages. Additionally, the size of the twiddle factor table is reduced by a binary split.</p>
<p>When running on an Intel® Xeon® Processor X5492 (2 x 4-core, 3.4 GHz) running a 64-bit operating system FFT performance improved by up to 5 times for 8 threads, when compared to Intel MKL 10.1 Update 2.</p>
<p>Customers just need to upgrade to the Intel MKL 10.2 in order to take advantage of this new feature for multi-threaded applications. No code changes are required.</p>
<p>For more information the Intel MKL FFTs, please refer to <a target="_blank" href="http://software.intel.com/sites/products/documentation/hpc/mkl/mklman.pdf">Chapter 11 of the Reference Manual</a>. Also refer to <a target="_blank" href="http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/">Intel MKL Linking Advisor</a> for details on how to link to the FFT functions.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/mkl-threaded-1d-ffts</link>
      <pubDate>Fri, 17 Jul 2009 16:37:27 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/mkl-threaded-1d-ffts#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/mkl-threaded-1d-ffts</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>New in Intel® MKL 10.2</title>
      <description><![CDATA[ <p><strong>New in Intel® MKL 10.2:</strong></p>
<ul>
<li>New features 
<ul>
<li>LAPACK 3.2 </li>
<li>Introduced implementation of the DZGEMM Extended BLAS function (as described at http://www.netlib.org/blas/blast-forum/). See the description of the ?gemm family of functions in the BLAS section of the reference manual. </li>
<li>PARDISO now supports real and complex, single precision data </li>
</ul>
</li>
<li>Usability/Interface improvements 
<ul>
<li>Sparse matrix format conversion routines which convert between the following formats: 
<ul>
<li>CSR (3-array variation) &lt;-&gt; CSC (3-array variation) </li>
<li>CSR (3-array variation) &lt;-&gt; diagonal format </li>
<li>CSR (3-array variation) &lt;-&gt; skyline </li>
</ul>
</li>
<li>Fortran95 BLAS and LAPACK mod files are now included 
<ul>
<li>Modules are pre-built with the Intel compiler and located in the include directory (see Intel® MKL User's Guide for full path) </li>
<li>Source is still included for use with other compilers </li>
<li>Documentation for these interfaces can be found in the Intel® MKL User's Guide </li>
</ul>
</li>
<li>The FFTW3 interface is now integrated directly into the main libraries 
<ul>
<li>Source code is still included to create wrappers for use with compilers not compatible with the default Intel® Fortran compiler convention for name decoration </li>
<li>See Appendix G of the Reference Manual for information </li>
</ul>
</li>
<li>DFTI_DESCRIPTOR_HANDLE now represents a true type name and can now be referenced as a type in user programs </li>
<li>Added parameter to Jacobi matrix calculation routine in the optimization solver domain to allow access to user data (see the description of the djacobix function in the reference manual for more information) </li>
<li>Added an interface mapping calls to single precision BLAS functions in Intel® MKL (functions with 's' or 'c' initial letter) to 64-bit floating point precision functions has been added on 64-bit architectures (See 'sp2dp' in the Intel® MKL User Guide for more information) </li>
<li>Compatibility libraries (also known as "dummy" libraries) have been removed from this version of the library </li>
</ul>
</li>
<li>Performance improvements 
<ul>
<li>Further threading in BLAS level 1 and 2 functions for Intel® 64 architecture 
<ul>
<li>Level 1 functions (vector-vector): (CS,ZD,S,D)ROT, (C,Z,S,D)COPY, and (C,Z,S,D)SWAP 
<ul>
<li>Increase in performance by up to 1.7-4.7 times over version 10.1 Update 1 on 4-core Intel® Core™ i7 processor depending on data location in cache </li>
<li>Increase in performance by up to 14-130 times over version 10.1 Update 1 on 24-core Intel® Xeon® processor 7400 series system, depending on data location in cache </li>
</ul>
</li>
<li>Level 2 functions (matrix-vector): (C,Z,S,D)TRMV, (S,D)SYMV, (S,D)SYR, and (S,D)SYR2 
<ul>
<li>Increase in performance by up to 1.9-2.9 times over version 10.1 Update 1 on 4-core Intel® Core™ i7 processor, depending on data location in cache </li>
<li>Increase in performance by up to 16-40 times over version 10.1 Update 1 on 24-core Intel® Xeon® processor 7400 series system, depending on data location in cache </li>
</ul>
</li>
</ul>
</li>
<li>Introduced recursive algorithm in 32-bit sequential version of DSYRK for up to 20% performance improvement on Intel® Core™ i7 processors and Intel® Xeon® processors in 5300, 5400, and 7400 series. </li>
<li>Improved LU factorization (DGETRF) by 25% over Intel MKL 10.1 Update 1 for large sizes on the Intel® Xeon® 7460 Processor; small sizes are also dramatically improved </li>
<li>BLAS *TBMV/*TBSV functions now use level 1 BLAS functions to improve performance by up to 3% on Intel® Core™ i7 processors and up to 10% on Intel® Core™2 processor 5300 and 5400 series. </li>
<li>Improved threading algorithms to increase DGEMM performance 
<ul>
<li>up to 7% improvement on 8 threads and up to 50% on 3,5,7 threads on the Intel® Core™ i7 processor </li>
<li>up to 50% improvement on 3 threads on Intel® Xeon® processor 7400 series. </li>
</ul>
</li>
<li>Threaded 1D complex-to-complex FFTs for non-prime sizes </li>
<li>New algorithms for 3D complex-to-complex transforms deliver better performance for small sizes (up to 64x64x64) on 1 or 2 threads </li>
<li>Implemented high-level parallelization of out-of-core (OOC) PARDISO when operating on symmetric positive definite matrices. </li>
<li>Reduced memory use by PARDISO for both in-core and out-of-core on all matrix types 
<ul>
<li>PARDISO OOC now uses less than half the memory previously used in Intel MKL 10.1 for real symmetric, complex Hermitian, or complex symmetric matrices </li>
</ul>
</li>
<li>Parallelized Reordering and Symbolic factorization stage in PARDISO/DSS </li>
<li>Up to 2 times better performance (30% improvement on average) on Intel® Core™ i7 and Intel® Core™2 processors for the following VML functions: v(s,d)Round, v(s,d)Inv, v(s,d)Div, v(s,d)Sqrt, v(s,d)Exp, v(s,d)Ln, v(s,d)Atan, v(s,d)Atan2 </li>
<li>Optimized versions of the following functions available for Intel® Advanced Vector Extensions (Intel® AVX) 
<ul>
<li>BLAS: DGEMM </li>
<li>FFTs </li>
<li>VML: exp, log, and pow </li>
<li>See important information in the Intel® MKL User's Guide regarding the mkl_enable_instructions() function for access to these functions </li>
</ul>
</li>
</ul>
</li>
</ul> ]]></description>
      <link>http://software.intel.com/en-us/articles/new-in-intel-mkl-10-2</link>
      <pubDate>Tue, 23 Jun 2009 11:43:12 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/new-in-intel-mkl-10-2#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/new-in-intel-mkl-10-2</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Setting thread affinity on SMT or HT enabled systems for better performance</title>
      <description><![CDATA[ Simultaneous MultiThreading (SMT) or Hyper-Threading Technology (HT Technology) is especially effective when each thread is performing different types of operations and when there are under-utilized resources on the processor.  However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread.  You may obtain higher performance by disabling HT/SMT Technology.  See Using the Intel® MKL Parallelism for information on the default number of threads, changing this number, and other relevant details.<br /><br />If you run with SMT/HT enabled, performance may be especially impacted if you run on fewer threads than physical cores.  Moreover, if, for example, there are two threads to every physical core, the thread scheduler may assign two threads to some cores and ignore the other ones altogether. If you are using the OpenMP* library of the Intel Compiler, read the respective User Guide on how to best set the thread affinity interface to avoid this situation. <br /><br /><br />For Intel MKL, you are recommended to set<br /><br />KMP_AFFINITY=granularity=fine,compact,1,0. ]]></description>
      <link>http://software.intel.com/en-us/articles/setting-thread-affinity-on-smt-or-ht-enabled-systems</link>
      <pubDate>Mon, 22 Jun 2009 22:59:30 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/setting-thread-affinity-on-smt-or-ht-enabled-systems#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/setting-thread-affinity-on-smt-or-ht-enabled-systems</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Intel® MKL LAPACK 3.2</title>
      <description><![CDATA[ <p>LAPACK 3.2 features have been added to Intel® MKL 10.2. LAPACK 3.2 introduces the following new features as described on netlib.org:</p>
<ul>
<li>Extra Precise Iterative Refinement </li>
<li>Non-Negative Diagonals from Householder QR </li>
<li>High Performance QR and Householder Reflections on Low-Profile Matrices </li>
<li>New fast and accurate Jacobi SVD </li>
<li>Routines for Rectangular Full Packed format </li>
<li>Pivoted Cholesky </li>
<li>Mixed precision iterative refinement routines for exploiting fast single precision hardware </li>
<li>Some new variants added for the one sided factorization </li>
<li>More robust DQDS algorithm</li>
</ul> ]]></description>
      <link>http://software.intel.com/en-us/articles/intel-mkl-lapack-32</link>
      <pubDate>Mon, 22 Jun 2009 20:41:18 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-mkl-lapack-32#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/intel-mkl-lapack-32</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Enabling parallelization in reordering and symbolic factorization steps in PARDISO and DSS</title>
      <description><![CDATA[ Reordering and symbolic factorizations steps in PARDISO and DSS (Direct Sparse Solver) are threaded from Intel MKL 10.2 version onwards. This feature is given as optional and by default, the parallelization in the above steps is turned off and this article explains how to enable the parallelization.<br /><br /><strong>PARDISO:</strong> To turn on OpenMP parallelization set iparm(2) = 3 in PARDISO interface. This will enable parallel (OpenMP) version of the nested dissection algorithm to reduce computation of phase 1 of PARDISO on multi-core computers. <br /><br /><strong>DSS:</strong> By using MKL_DSS_METIS_OPENMP_ORDER in "opt" parameter of dss_reorder function will compute permutation vector using parallel (OpenMP) nested dissection algorithm to minimize fill-in during factorization phase. This will help to reduce time of dss_reorder call on multi-core computers. <br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/enabling-parallelization-in-reordering-and-symbolic-factorization-steps-in-pardiso-and-dss</link>
      <pubDate>Tue, 05 May 2009 03:01:18 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/enabling-parallelization-in-reordering-and-symbolic-factorization-steps-in-pardiso-and-dss#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/enabling-parallelization-in-reordering-and-symbolic-factorization-steps-in-pardiso-and-dss</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Enabling parallelization in reordering and symbolic factorization steps in PARDISO and DSS</title>
      <description><![CDATA[ Reordering and symbolic factorizations steps in PARDISO and DSS (Direct Sparse Solver) are threaded from Intel MKL 10.2 version onwards. This feature is given as optional and by default, the parallelization in the above steps is turned off and this article explains how to enable the parallelization.<br /><br /><strong>PARDISO:</strong> To turn on OpenMP parallelization set iparm(2) = 3 in PARDISO interface. This will enable parallel (OpenMP) version of the nested dissection algorithm to reduce computation of phase 1 of PARDISO on multi-core computers. <br /><br /><strong>DSS:</strong> By using MKL_DSS_METIS_OPENMP_ORDER in "opt" parameter of dss_reorder function will compute permutation vector using parallel (OpenMP) nested dissection algorithm to minimize fill-in during factorization phase. This will help to reduce time of dss_reorder call on multi-core computers. <br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/enabling-parallelization-in-reordering-and-symbolic-factorization-steps-in-pardiso-and-dss</link>
      <pubDate>Tue, 05 May 2009 03:01:18 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/enabling-parallelization-in-reordering-and-symbolic-factorization-steps-in-pardiso-and-dss#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/enabling-parallelization-in-reordering-and-symbolic-factorization-steps-in-pardiso-and-dss</guid>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
  </channel></rss>