Ready for 2X Moore's Law: Intel Cluster Studio XE

Today we introduced Intel® Cluster Studio XE, an exciting collection of powerful tools, for HPC programmers who use MPI along with other programming models to make the most of clusters and supercomputers. Intel Cluster Studio XE provides two substantial new capabilities to assist in hybrid programming:  The existing Intel Cluster Studio with additional MPI scaling and job control features plus substantial node-level analysis capabilities.

Hybrid programming combines MPI, used for internode parallelism, with a shared memory model such as OpenMP, Intel Threading Building Blocks (TBB) or Intel Cilk Plus, for intranode parallelism. To assist with hybrid programming, Cluster Studio XE includes cluster installation and usage support for Intel Inspector XE and Intel VTune Amplifier XE. Cluster installation for these tools makes getting started easier. The cluster usage of these tools allow them to gather node-level data on dozens, hundreds or thousands of processes. Both tools then take results and present them in a hierarchical format starting with a “by rank” view of the application.

Intel Inspector XE allows determination of memory errors, such as memory leaks, as well as threading errors, such as race conditions and deadlocks, to be pinpointed. (Learn more with "Using Intel® Inspector XE 2011 to Find Data Races in Multithreaded Code.")

Intel VTune Amplifier XE allows precise performance information to be probed to fully understand what is happening that affects application performance. VTune Amplifier XE probes node level performance, and beautifully complements the Intel Trace Analyzer and Collector, which probes MPI communication performance. Together, they offer an unequaled view of performance in a hybrid program.

New SLURM job manager support
The Intel MPI Library 4.0.3 offers better integration with SLURM job manager(s).  This provides for tighter control over job submission and startup time. It also provides information to allow process cleanup when a program terminates prematurely due to errors.

The MPI Library has been extended to allow visibility and control to the job scheduler for how many ranks and its respective resource utilization (memory, CPU usage, access to cache, etc.). Before this Intel MPI work, a job scheduler didn’t know if a rank died/ended leading to a condition akin to a resource leak (requiring a "kill -9" on the process).  When running many processes with this happening it could be quite a problem.  Now SLURM has visibility into the process state across the ranks and is able to clean up properly. There is additional information in the documentation on how to set-up and use this capability with any SLURM compatible based job scheduler.

Faster than Moore’s Law?
I’m fascinated by a trend that is running at a little more than 2X the annual rate of Moore’s Law: the increase in performance of supercomputers.  The Top 500 cluster growth graph shown here (from www.Top500.org) clearly shows that the performance of the Top 500 supercomputers has been consistently growing at an annual rate of over 80%, whereas Moore’s Law is a 40% per year growth rate.



Of course, Moore’s Law is about the doubling of transistor densities about every two years. The transistor density increases have in turn driven the computer industry to deliver more and more computer performance. Supercomputer designs have been able to use parallelism at multiple levels to double down on this trend and grow performance at a spectacular rate.

Hybrid programming rides this wave. Ten years ago, MPI programming was most often enough for large systems. Over the last decade, we have seen the individual cluster notes continue to get “fatter." This “obesity” at the node level has driven HPC developers to program for internode level parallelism differently than node level parallelism. This is most often seen as MPI + OpenMP, and the node level programming continues to get richer with more options all the time.

Which brings us back to why Cluster Studio XE is so important, especially considering the new hybrid programming insights.

Cluster Studio XE is more than Inspector and VTune Amplifier
Cluster Studio XE is a combination of almost every HPC software development tool that Intel makes. This is because the largest scale machines, and the applications that go with them, exercise every method possible to keep up this “pacing at twice rate of Moore’s Law.”

Cluster Studio XE includes the Intel C/C++ and Fortran compilers and related libraries including the Intel Math Kernel Library (MKL) that offer unequaled optimization for Intel and compatible processors. Our goal is to offer superior performance and standards support. We have great performance, and we’ve included industry leading support for (most of) C++11, Fortran 2003, Fortran 2008, and IEEE 754-2008. All four of these newer standards are mostly supported but not completely. No one has all four of these implemented – and we believe we have made at least as much progress as anyone else. Consult our documentation for details on what is done, and what is not. I think you will find that we have implemented the most important and most requested portions of each standard already (with more to come). We also have the latest Cilk Plus 1.1, TBB 4.0 and OpenMP 3.1 standards fully implemented. MKL offers core math functions include BLAS, LAPACK, sparse solvers, fast Fourier transforms, vector math, and more.  It also includes a highly optimized version of ScaLAPACK on clusters and delivers significant performance improvements.

Multicore today, and ready for a many-core future
Cluster Studio XE contains the tools and models for multicore programming today, and we are aligned and ready for many-core programming tomorrow. We believe strongly that the growth in cores in our future should not force a developer to split methods of programming. Writing scalable applications is not an easy job, but we can at least make it a single job instead of two jobs. The techniques and tools for scaling on multicore today are the same ones we will employ for many-core as well. Future generations of Cluster Studio XE will include multicore and many-core support throughout. Today, you can rest assured that the multicore support for today’s systems are aligned with this future. We have many-core support in limited usage today with many-core prototype systems (Knights Ferry), and are getting ready for one of the first many-core systems to be delivered (Stampede). Come see us at Supercomputing (in Seattle) to learn more. I’ll be there all week as will many other members of the Intel Software Development Products team. You might even run into Dr. Fortran. Look for us at future software conferences around the world – we really enjoy meeting developers and talking about how we can help!

Intel Cluster Studio XE: try it now

Intel Cluster Studio XE provides the key functionality that MPI programmers need to develop optimal programs for HPC needs. Cluster Studio XE offer a single package that simplifies installation at an economical price for those who want it all. Please try it, and let us know what you think, and what more we can do for you.

For more complete information about compiler optimizations, see our Optimization Notice.