Books - High Performance Parallelism Pearls

The two “Pearls” books contain an outstanding collection of examples of code modernization, complete with discussions by software developers of how code was modified with commentary on what worked as well as what did not!  Code for these real world applications is available for download from whether you have bought the books or not.  The figures are freely available as well, a real bonus for instructors who choose to uses these examples when teaching code modernization techniques.  The books, edited by James Reinders and Jim Jeffers, had 67 contributors for volume one, and 73 contributors for volume 2. 

Experts wrote about their experiences in adding parallelism to their real world applications. Most examples illustrate their results on processors and on the Intel® Xeon Phi™ coprocessor. The key issues of scaling, locality of reference and vectorization are recurring themes as each contributed chapter contains explanations of the thinking behind adding use of parallelism to their applications. The actual code is shown and discussed, with step-by-step thinking, and analysis of their results.  While OpenMP* are MPI are the dominant method for parallelism, the books also include usage of TBB, OpenCL and other models. There is a balance of Fortran, C and C++ throughout. With such a diverse collection of real world examples, the opportunities to learn from other experts is quite amazing.


Volume 1 includes the following chapters:

Foreword by Sverre Jarp, CERN.

Chapter 1: Introduction

Chapter 2: From ‘Correct’ to ‘Correct & Efficient’: A Hydro2D Case Study with Godunov’s Scheme

Chapter 3: Better Concurrency and SIMD on HBM

Chapter 4: Optimizing for Reacting Navier-Stokes Equations

Chapter 5: Plesiochronous Phasing Barriers

Chapter 6: Parallel Evaluation of Fault Tree Expressions

Chapter 7: Deep-Learning and Numerical Optimization

Chapter 8: Optimizing Gather/Scatter Patterns

Chapter 9: A Many-Core Implementation of the Direct N-body Problem

Chapter 10: N-body Methods

Chapter 11: Dynamic Load Balancing Using OpenMP 4.0

Chapter 12: Concurrent Kernel Offloading

Chapter 13: Heterogeneous Computing with MPI

Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor

Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment

Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors

Chapter 17: NWChem: Quantum Chemistry Simulations at Scale

Chapter 18: Efficient Nested Parallelism on Large-Scale Systems

Chapter 19: Performance Optimization of Black-Scholes Pricing

Chapter 20: Data Transfer Using the Intel COI Library

Chapter 21: High-Performance Ray Tracing

Chapter 22: Portable Performance with OpenCL

Chapter 23: Characterization and Optimization Methodology Applied to Stencil Computations

Chapter 24: Profiling-Guided Optimization

Chapter 25: Heterogeneous MPI optimization with ITAC

Chapter 26: Scalable Out-of-Core Solvers on a Cluster

Chapter 27: Sparse Matrix-Vector Multiplication: Parallelization and Vectorization

Chapter 28: Morton Order Improves Performance


Volume 2 includes the following chapters:

Foreword by Dan Stanzione, TACC

Chapter 1: Introduction

Chapter 2: Numerical Weather Prediction Optimization

Chapter 3: WRF Goddard Microphysics Scheme Optimization

Chapter 4: Pairwise DNA Sequence Alignment Optimization

Chapter 5: Accelerated Structural Bioinformatics for Drug Discovery     

Chapter 6: Amber PME Molecular Dynamics Optimization

Chapter 7: Low Latency Solutions for Financial Services

Chapter 8: Parallel Numerical Methods in Finance    

Chapter 9: Wilson Dslash Kernel From Lattice QCD Optimization

Chapter 10: Cosmic Microwave Background Analysis: Nested Parallelism In Practice  

Chapter 11: Visual Search Optimization

Chapter 12: Radio Frequency Ray Tracing

Chapter 13: Exploring Use of the Reserved Core

Chapter 14: High Performance Python Offloading

Chapter 15: Fast Matrix Computations on Asynchronous Streams 

Chapter 16: MPI-3 Shared Memory Programming Introduction

Chapter 17: Coarse-Grain OpenMP for Scalable Hybrid Parallelism  

Chapter 18: Exploiting Multilevel Parallelism with OpenMP

Chapter 19: OpenCL: There and Back Again

Chapter 20: OpenMP vs. OpenCL: Difference in Performance?      

Chapter 21: Prefetch Tuning Optimizations

Chapter 22: SIMD functions via OpenMP

Chapter 23: Vectorization Advice  

Chapter 24: Portable Explicit Vectorization Intrinsics

Chapter 25: Power Analysis for Applications and Data Centers


For more complete information about compiler optimizations, see our Optimization Notice.