Intel® Distribution for Python 2017 Update 2 accelerates five key areas for impressive performance gains

By Sergey A Maydanov,

Published:02/23/2017   Last Updated:02/23/2017

Intel Corporation is pleased to announce the release of Intel® Distribution for Python* 2017 Update 2, which offers both performance improvements and new features. 

Update 2 offers great performance improvements for NumPy*, SciPy*, and Scikit-learn* that you can see across a range of Intel processors, from Intel® Core™ CPUs to Intel® Xeon® and Intel® Xeon Phi™ processors. 

Benchmarks for all these accelerations will be published soon. This post provides a preview of the nature, extent, and impact to you. 

Fast Fourier Transforms
In addition to initial Fast Fourier Transforms (FFT) optimizations offered in previous releases, Update 2 brings widespread optimizations for NumPy and SciPy FFT. It offers a layered interface for the Intel® Math Kernel Library (Intel® MKL) that allows efficient access to native FFT optimizations from a range of NumPy and SciPy functions. The optimizations include real and complex data types, both single and double precision. Update 2 covers both 1D and multidimensional data, in place and out of place. As a result, performance may improve up to 60x over Update 1 and is now close to native C/Intel MKL.

Arithmetic and transcendental expressions
NumPy is designed for high-performance basic arithmetic and transcendental operations on ndarrays. Some umath primitives are optimized to benefit from SSE, AVX and (recently) from AVX2 instruction sets, but not from AVX-512. Also, original NumPy functions did not take advantage of multiple cores. Update 2 provides substantial changes to the guts of NumPy to incorporate the Intel MKL Vector Math Library (VML) in respective umath primitives, which enables support for all available cores on a system and all CPU instruction sets. 

The logic in Update 2 NumPy umath works as follows:
•    For short NumPy arrays, the overheads to distribute work across multiple threads are high relative to the amount of computation work. In such cases, Update 2 uses the Intel MKL Short Vector Math Library (SVML), which is optimized for good performance across a range of Intel CPUs on short vectors. 
•    For large arrays, threading overheads are lower compared to the amount of computation and Update 2 uses the Intel MKL VML, which is optimized for utilizing multiple cores and a range of Intel CPUs.
NumPy Arithmetic and transcendental operations on vector-vector and vector-scalar are accelerated up to 400x for Intel® Xeon Phi processors.

Memory management optimizations
Update 2 introduces widespread optimizations in NumPy memory management operations. As a dynamic language, Python manages memory for the user. Memory operations, such as allocation, de-allocation, copy, and move, affect performance of essentially all Python programs. 

Specifically, Update 2 ensures NumPy allocates arrays that are properly aligned in memory on Linux, so that NumPy and SciPy compute functions can benefit from respective aligned versions of SIMD memory access instructions. This is especially relevant for Intel® Xeon Phi processors.
The most significant improvements in memory optimizations in Update 2 comes from replacing original memory copy and move operations with optimized implementations from Intel MKL. The result: improved performance because these Intel MKL routines are optimized for both a range of Intel CPUs and multiple CPU cores.

Faster Machine Learning with Scikit-learn
Scikit-learn is among the most popular Python machine learning packages. The initial release of Intel Distribution for Python provided Scikit-learn optimizations via respective NumPy and SciPy functions accelerated by Intel MKL. Update 2 optimizes selective key machine learning algorithms in Scikit-learn, accelerating them with the Intel® Data Analytics Acceleration Library (Intel® DAAL).

Specifically, Update 2 optimizes Principal Component Analysis (PCA), Linear and Ridge Regressions, Correlation and Cosine Distances, and K-Means. Speedups may range from 1.5x to 160x.

Neural network enhancements for pyDAAL
Intel DAAL:
•    Introduces a number of extensions for neural networks, such as the transposed convolution layer and the reshape layer. 
•    Now supports input tensors of arbitrary dimension in loss softmax cross-entropy layers, sigmoid cross-entropy criterion, and truncated Gaussian initializer for tensors.  
•    Extends support for distributed computing by adding the objective function with pre-computed characteristics.  
pyDAAL comes with improved performance for neural network layers used in topologies such as AlexNet. 

The Intel Distribution for Python is powered by Anaconda* and conda build infrastructures that give all Python users the benefit of interoperability within these two environments and access to the optimized packages through a simple conda install command.
Intel Distribution for Python 2017 Update 2 delivers significant performance optimizations for many core algorithms and Python packages, while maintaining the ease of download and install. 

Update 2 is available for free download at the Intel Distribution for Python website or through the Intel channel at
The Python team at Intel welcomes you to try it out and email us any feedback. 

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804