Today we released updates for our C++ and Fortran compilers, our Intel Math Kernel (MKL) and Intel Integrated Performance Primitives (IPP) libraries and Cluster toolkits. Noteworthy additions include outstanding performance enhancements, support of Intel® Advanced Vector Extensions (AVX) and inclusion of some elements that debuted in Intel® Parallel Studio last month.
I can share some notes on the features, including our AVX and AES support in the tools (which I believe is the first product support in tools for Intel and compatible processors), our adaptation of some of new features from Parallel Studio to Linux and Mac OS X, and really great tuning of our performance leading MPI library.
The specific new product versions are:
- Intel® Professional Edition Compilers 11.1 (Fortran & C/C++, for Windows, Linux, Mac OS X)
- Intel® Integrated Performance Primitives (IPP) 6.1 (for Windows, Linux, Mac OS X)
- Intel® Math Kernel Library (MKL) 10.2 (for Windows, Linux, Mac OS X)
- Intel® Cluster Toolkit, Compiler Edition 3.2.1 (for Windows, Linux)
- Intel® MPI Library 3.2.1 (for Windows, Linux)
If you've not moved from the 10.x to 11.x compilers, you will want to consider doing that. Aside from new functionality such as parallel debugging, OpenMP 3.0 and AVX support - you are very likely to see pleasing performance boost esp. on the latest Intel and compatible processors. Several customer have told us of 10% performance gains in moving from 10.x to 11.1. While I can't promise such gains to everyone, you have a reasonable shot at seeing performance gains based on what enhancement we did and what feedback we have been getting from users.
Likewise, moving from version 9.x to 10.x for the Intel Math Kernel library (MKL) has shown up to 45% gains in key routines. This is incredible given how consistently MKL is the library to beat in performance - a leadership position our MKL developers are not just maintaining - they are enlarging it! Of course, you don't have to take my word for it that we do this well for Intel and compatible processors - you can find reviews on the web including a recent one at http://www.digit-life.com/articles3/cpu/phenom-x4-matlab-p2.html.
Of course, Intel Integrated Performance Primitives (IPP) and Intel MPI library have similar success stories - and you will want to stay up-to-date for the latest performance. With IPP 6.1, task parallelism usage gains give as much as 250% multicore performance scaling while the PNG codec added to Unified Image Codec framework offers 300% faster encoding than the open source reference version. Intel MPI 3.2.1 offers industry leading performance with low latency and high bandwidths, and now uses direct inter-process memory copy for increased bandwidth on Windows systems.
Intel® Advanced Vector Extensions (AVX) support
We have offered support for developing AVX (AVX is a 256 bit instruction set extension to SSE and is designed for applications that are floating point intensive) from Whatif.intel.com for about a year now, and we've enjoyed the feedback on these offerings and input on our future direction. One recurring request has been for us to make the support for AVX a feature in our compilers and library products now, before the processors supporting AVX are available to purchase. This makes it possible to create and ship software now that is ready to utilize processors with AVX support. We have validated our code using simulators for our future processors (you can get the Intel® Software Development Emulator from Whatif.intel.com).
Many software vendors will want to do some testing on real processors before they ship - and having these compilers and libraries now makes that easy and realistic. There is plenty of time to incorporate our latest versions into your build systems, validate them for usage, and be fully ready for testing with processors using AVX. We've been reminded often that it's naive to expect that releasing compilers and libraries concurrent with new processors shipping can be adopted quickly. We have listened and acted on this feedback!
Tilo Kühn at Maxon Computer said, “We’ve been enthusiastically using the new version of Intel® C++ Compiler Professional Edition that includes support for Intel® Advanced Vector Extensions (Intel AVX.) Being able to performance tune our software well in advance of processor availability gives us a major development head start to ensure that our Cinebench product will be ready when the first Intel AVX-enabled processor is delivered.”
Performance using AVX can be incredible, but it is important to know its limits. In general, code using AVX for data parallel problems should outperform code using SSE. That is in general the key thing to know about using AVX - it should do at least as well as code using SSE. This does, of course, assume you overcome the overhead of any alignment and loop setup/tear down. It you have short vectors that are a multiple of 128 bits in length but not 256 bits, you may be better off with SSE. That is understandable. Aside from that, AVX should win in performance - which begs the question "by how much?" The answer, of course, is "it depends." It depends on the exact processor design, your algorithm, and your system design. The highest gains will come from code with intense computations running out of data cache. It isn't hard to imagine gains on such code approaching the theoretical doubling that moving 128 bits to 256 allows, but will be dependent on the processor and system design. The rest of the gains will depends on the factors mentioned. With AVX generally better than SSE, the migration to AVX is easy to choose. Our compilers and libraries make it even easier by easily producing both code paths (support for SSE-only, and for AVX+SSE).
Advanced Encryption Standard (AES)
In addition to the anticipated AVX support, we have our earliest Advanced Encryption Standard (AES) support. Unlike AVX, I expect AES will be used by very few developers directly - but our compilers have intrinsics and inline assembly support for AES. It Future versions of our Intel IPP library cryptographic algorithms will use AES, but those did not make it in the current release.
Ripped from Parallel Studio
I know the big news of Intel Parallel Studio last month created a few questions like "when will you have that for Linux? or Mac OS X?" I assure you - you see many features adapted in time! Now - some of that is here now. Specifically, compiler and library features - including debugger extensions - from Intel Parallel Studio have arrived!
The Intel® Parallel Debugger Extensions have been added to Intel® C++ Compiler, Professional Edition for Windows. This allow serializing parallel regions, finding data sharing violations, breaking on re-entrant functions, viewing all active thread structures, OpenMP* task teams and trees, barriers, locks, and waits. Of course, this works in current versions of Visual Studio (both 2005 and 2008).
The Intel C++ Compiler 11.1 offers all the functionality of Intel Parallel Composer, plus the AVX support, on Windows, Linux and Mac OS X. We've added Eclipse CDT 5.0 support, SLES11 support, Native Intel®64 compiler for Mac OS X, and we support the new Mac Xcode IDE ability to relocate the tools installation directories.
If you are wondering about updates for Intel Thread Checker and Intel VTune Performance Analyzer - you'll see that we are updating the compilers, libraries and cluster tools with this release - while analysis and tuning tool updates are still in the works. Rest assured we are working to make it a matter of "when" not "if."
Math Kernel Library
Like the compilers, we have performance improvements in many areas (reason enough to upgrade), AVX support, and well tuned support for the latest Intel® Xeon® 5500 processors.
FFT routines have been enhanced by adding scaling factors 1/N, 1/sqrt(N), adding DFTI_FORWARD_SIGN, implementing radices mix of 7, 11 and 13 and optimized real data transforms in Cluster FFT. All this with strong support of FFTW interfaces. We've added single precision support in PARDISO (Parallel Direct and Iterative Solvers) and complete support for LAPACK 3.2. For .NET users, we have included .NET/C# examples for calling MKL functions.
Fortran 2003 and beyond
For Fortran, as always we have focused aggressively on performance while implementing features from Fortran 2003 in the order our customers have encouraged. Version 11.1 adds most of the "object oriented" features. As of version 11.1, Intel we have a majority of Fortran 2003 features implemented with only a few smaller items remaining plus two bigger items: parameterized derived types (PDT) and user-defined derived type I/O (UDDTIO). These two features are demanding to implement and in low demand, They are also not supported in most Fortran compilers causing them to not be used in anything portable. While we plan to support these eventually, we expect to finish off all other issues first and embark on some of Fortran 2008 first (much stronger customer demand).
The next revision of the Fortran standard is called Fortran 2008, with an expected publication in mid-2010. While there are many small changes in Fortran 2008, there are a few new features for which we’ve already received requests - coarrays, submodules and bumping up array dimensions from 7 to 15. We are gathering feedback on these now.
Intel Integrated Performance Primitives (IPP)
Like MKL and the compilers, we have performance improvements in many areas, AVX support, and well tuned support for the latest Intel® Xeon® 5500 processors. From early simulator-based evaluation, a select set of 65 optimized functions showed an average a 50% speedup.
Intel IPP has loaded up on new functionality including a novel way to beat back Amdahl's Law and get better cache utilization at the same time with the Deferred Mode Image Processing (DMIP) Framework. DMIP is worth a look - its a new feature to help deliver pipelined parallelism. This reduces the serialization efefct you normally see when using repeated library calls, and though bette cache and multicore utilization it dramatically improves performance of pipelined image operations, especially on larger images. The 6.1 version introduces task parallelism and as much as 250% multicore performance scaling.
PNG codec added to Unified Image Codec framework offers 300% faster encoding than the open source reference implementation. This is very important as PNG is replacing GIF in usage around the world. Visual Studio integration improvements allow intellisense autocompletes function names and exposes parameter details for faster, more accurate inclusion of Intel IPP functions. Texture compression, advanced lighting, and 3D geometric super sampling functions added for improved image processing performance. Improved data compression deflate/inflate APIs provide better zlib compatibility and superior performance. New cryptography functions (RSA_SSA1.5,RSA_PKCSv1.5) added to support HDCP 2.0 standard.
.NET users will find and intuitive programming layer for both C++ & .NET image processing application development.
Intel® MPI Library 3.2.1 leads with low latency and high bandwidth, and now uses direct inter-process memory copy for increased bandwidth on Windows. It features improved automatic process pinning for more performance. Scalable mpdboot startup is offered for faster cluster application launching.
Intel Cluster Toolkit Installations easier on Windows and Linux
Windows users will find Active Directory based user authorization for seamless integration into Windows* environment. Linux users will find full support of Linux* Standard Base (LSB) compliant RPMs.
Of course, the Cluster Toolkits contain all the wonderful compiler and library enhancement mentioned earlier and support for the Intel Cluster Ready program continues as well!
Links to more information
There is more information about AVX and AES at http://intel.com/sofware/avx.
Evaluation copies of everything I've mentioned are downloadable from http://intel.com/software/products - you can download the new compilers and libraries to evaluate the new versions starting today!