Modernizing Software with Future-Proof Code Optimizations

by Henry A. Gabb, Sr. Principal Engineer, Intel Software and Services Group

Bull image

Create High Performance, Scalable and Portable Parallel Code with New Intel® Parallel Studio XE 2018

Intel® Parallel Studio XE is our flagship product for software development, debugging, and tuning on Intel processor architectures for HPC, enterprise, and cloud computing. It is a comprehensive tool suite that contains everything from compilers and high-performance math libraries all the way to debuggers and profilers for large-scale cluster applications. These tools enable developers to exploit the full performance potential of Intel® processors. Intel Parallel Studio XE is designed to help developers create high performance, scalable, reliable parallel code—faster.

Intel Parallel Studio XEThe latest release, Intel Parallel Studio XE 2018, contains many new and interesting features [1]. Let’s start with parallelism. It’s in the product name, after all. Software development and parallelism used to be separate concerns, and parallel computing was mainly confined to high-performance computing practitioners. Today, however, parallel architectures are ubiquitous. Multicore processors are now in handheld devices—all the way up to the world’s most powerful supercomputers.

The Intel® Compilers support the OpenMP* 4.5 standard for compiler-directed multithreading, plus initial support for the 5.0 draft. OpenMP is now 20 years old and continues to evolve with new hardware architectures [2, 3, 4]. The latest versions provide computation offload to accelerator devices, vectorization directives, enhanced control of thread placement, and much more [5]. For distributed-memory process-level parallelism, the Intel® MPI Library supports the latest message-passing interface (MPI) standard, and contains many optimizations for collective communication, job startup and shutdown, and support for the latest high-speed interconnects like the Intel® Omni-Path Architecture (Intel® OPA). Combining OpenMP and MPI in the same application has proven to be a powerful way to achieve scalable parallelism on modern clusters.

The number of cores per socket has steadily increased since the first multicore processor was released, but while higher-level parallelism is important, lower-level code tuning should not be ignored. In fact, parallelizing code that has not been properly tuned can be counterproductive. There are few things more disheartening than going through the effort of parallelizing an application only to find that vectorizing a few key loops gives better performance and renders the previous parallelization unnecessary. Vectors continue to get wider in modern processor architectures so the Intel compilers contain many new enhancements to enable efficient vectorization [6]. In addition to the OpenMP vectorization directives mentioned above, the Intel compilers exploit the latest Intel® Advanced Vector Extensions (Intel® AVX-512) instructions in Intel® Xeon® Scalable and Xeon Phi™ processor architectures [7].

The compilers in Intel Parallel Studio XE 2018 support the latest Fortran, C, and C++ standards. More recently, the Intel® Distribution for Python* was added to the suite. Our optimized Python distribution integrates the Intel® Performance Libraries into many Python packages (e.g., NumPy, SciPy, scikit-learn, mpi4py). (Other productivity languages like Julia* [8] and R* [9, 10], which are not part of the product, can also take advantage of the Intel performance libraries.) Intel Parallel Studio XE 2018 also includes the following highly-optimized libraries: Intel® Math Kernel Library (Intel® MKL), Intel® Integrated Performance Primitives (Intel® IPP), the Intel® Data Analytics Acceleration Library (Intel® DAAL), the Intel® MPI Library, and the Intel® Threading Building Blocks (Intel TBB). Intel® MKL provides tuned, parallel math functions for dense and sparse linear algebra, Fourier transforms, neural networks, random number generation, basic statistics, etc. The latest version contains new APIs to improve the performance of the bulk matrix multiplication and convolution required during neural network training. Common computations in image processing, computer vision, signal processing, compression/decompression, cryptography, and string processing are available in Intel® IPP [11]. The newest library in the suite, Intel® DAAL, supports basic statistics and machine learning (e.g., dimensionality reduction, anomaly detection, classification, regression, clustering) [9, 12, 13, 14].

Intel Parallel Studio XE wordcloudFor C++ programmers, Intel continues to support Intel® TBB (www.threadingbuildingblocks.org), the widely-used template library for task parallelism [15]. (Note that in spite of the name, Intel TBB is open-sourced under an Apache 2.0 license. Intel has always preferred open, vendor-neutral standards over proprietary programming models.) Intel TBB fully leverages multicore processors but its most exciting new feature is the flow graph coordination layer. Flow graph allows the programmer to describe complex workflows that the Intel TBB runtime uses to extract parallelism. Intel TBB flow graph could become the preferred parallel programming model for heterogeneous processor environments. Intel Parallel Studio XE 2018 contains a preview feature under Intel Advisor called Flow Graph Analyzer to help create optimize flow graphs [16].

In addition to compilers and performance libraries, Intel Parallel Studio XE 2018 contains powerful code analysis tools to assist with debugging and tuning at instruction-, thread-, and process-level parallelism. Intel® Inspector is a one-of-a-kind debugger that not only finds garden-variety bugs like memory leaks but also performs correctness checking on threaded code to identify data races, potential deadlocks, and other non-deterministic concurrency errors. Intel® VTune™ Amplifier provides basic profiling to find performance hotspots but it does so much more, e.g.: microarchitecture analysis, memory and I/O analysis, etc. Its latest release adds support for profiling applications running in containers and the new Application Performance Snapshot feature provides a one-page overview of an application’s efficiency and performance characteristics across MPI, CPU, FPU, and memory use. Intel® Advisor, another one-of-a-kind tool, allows users to quickly prototype regions for potential parallelism and project likely speedup. However, its most exciting new feature is cache-aware roofline analysis, which pinpoints underperforming loops, graphically shows which are good candidates for code tuning, and gives advice about the likely performance bottlenecks [6, 17]. The Intel® Trace Analyzer and Collector performs correctness checking and communication profiling of MPI applications.Its latest version now supports OpenSHMEM (www.openshmem.org), an open standard API for parallelism in a partitioned global address space (PGAS). PGAS could become an important programming model for future parallel systems. Finally, Intel® Cluster Checker, a tool for analyzing the cluster health, added new features to improve usability and diagnostic output, check Intel® Omni-Path Architecture (Intel® OPA), and much more [18].

Few Intel Parallel Studio XE users realize how much this tool suite has evolved, how mature some of its components really are (20 years+), and how it has driven new approaches and helped developers accelerate parallel programming performance significantly over the last decade. However, its design goal has remained the same – to enable future-proof code modernization. For example, the same cache optimization techniques (e.g., blocking and tiling) that were beneficial 20 years ago are still beneficial. Today, however, code modernization is about exploiting parallelism – starting with vectorization (instruction-level parallelism), then threading, and finally message-passing on distributed-memory clusters. What does the future hold: heterogeneous parallelism, PGAS languages, persistent memory, etc.? Whatever the future holds, Intel Parallel Studio XE will evolve accordingly.

More Resources

References

The following articles were published in recent issues of The Parallel Universe. Get future issues: Subscribe Today

  1. Jackson Marusarz “Modernize your code for performance, portability, and scalability: What’s new in Intel Parallel Studio XE 2018” The Parallel Universe #30.

  2. Rob Farber “Happy 20th Birthday, OpenMP: Making parallel programming accessible to C/C++ and Fortran programmers – and providing a software path to exascale computation” The Parallel Universe #28.
  3. Bronis R. de Supinski “OpenMP is turning 20! Making parallel programming accessible to C/C++ and Fortran programmers” The Parallel Universe #29.
  4. Barbara Chapman “Welcome to the adult world, OpenMP: After 20 years, it’s more relevant than ever” The Parallel Universe #30.
  5. Michael Klemm et al. “The present and future of the OpenMP API specification: How the gold standard parallel programming language has improved with each new version” The Parallel Universe #27.
  6. Robert H. Dodds “Vectorization becomes important – again: Open source code WARP3D exemplifies renewed interest in vectorization” The Parallel Universe #29.
  7. Martyn Corden “Vectorization opportunities for improved performance with Intel AVX-512: Examples of how Intel compilers can vectorize and speed up loops” The Parallel Universe #27.
  8. Ranjan Anantharaman et al. “Julia: A high-level language for supercomputing. The Julia Project continues to break new boundaries in scientific computing” The Parallel Universe #29.
  9. Steena Monteiro and Shaojuan Zhu “Accelerating linear regression in R with Intel DAAL: Make better predictions with this highly optimized open source package” The Parallel Universe #29.
  10. Drew Schmidt “HPC with R: The basics” The Parallel Universe #28.
  11. Chao Yu and Sergey Khlystov “Building fast data compression code for cloud and edge applications: How to optimize your compression with Intel Integrated Performance Primitives” The Parallel Universe #29.
  12. Vadim Pirogov et al. “Unleash the power of big data analytics and machine learning: How Intel performance libraries make it happen” The Parallel Universe #26.
  13. Oleg Kremnyov et al. “Solving real-world machine learning problems with Intel Data Analytics Acceleration Library” The Parallel Universe #28.
  14. Oleg Kremnyov et al. “Dealing with outliers: How to find fraudulent transactions in a real-world dataset” The Parallel Universe #30.
  15. “Intel Threading Building Blocks celebrates 10 years!” The Parallel Universe, Special Edition.
  16. Vasanth Tovinkere et al. “Driving code performance with Intel Advisor Flow Graph Analyzer: Optimizing performance for an autonomous driving application” The Parallel Universe #30.
  17. Kevin O’Leary et al. “Intel Advisor Roofline Analysis: A new way to visualize performance trade-offs” The Parallel Universe #27.
  18. Brock A. Taylor “Is your cluster healthy? Must-have cluster diagnostics in Intel Cluster Checker” The Parallel Universe #30.

Henry A GabbAbout the Author

Henry A. Gabb, Senior Principal Engineer at Intel Corporation, is a longtime high-performance and parallel computing practitioner. He has published numerous articles on parallel programming, computational life science, and cheminformatics. In case you couldn’t tell from the reference list, Henry is the editor of The Parallel Universe, Intel’s quarterly magazine devoted to software innovation. He was also editor and coauthor of Developing Multithreaded Applications: A Platform Consistent Approach and was the program manager of the Intel/Microsoft Universal Parallel Computing Research Centers.

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.