What is Code Modernization?

Modern high-performance computers are built with a combination of resources, including multi- and manycore processors, large caches, fast memory, high-bandwidth inter-processor communications fabric, and broad support for I/O capabilities. High-performance software needs to be designed to take full advantage of this wealth of resources.

Whether re-architecting or tuning existing applications for maximum performance or creating new applications for existing and future machines, it is critical to be aware of the interplay between programming models and the efficient use of these resources.

Consider this a starting point for information regarding code modernization.


A key ingredient for good parallel performance on modern hardware is to take full advantage of vector instructions, also known as Single Instruction Multiple Data (SIMD) instruction sets. Learn how to optimize scalar and serial operations by maintaining the proper precision, type constants, and using appropriate functions and precision flags.


Get more done by increasing the number of active threads in your software, and take advantage of all of the available cores on modern hardware.

Multinode (Cluster)

The cluster architecture can achieve high levels of parallel performance that can scale with the algorithm. Learn how to use the message passing interface (MPI) and the distributed memory model to architect your applications.

Memory Optimization

On all systems—from laptops to supercomputers—the cores can only operate at a full compute capacity if they are provided with data at the maximum rate at which they can process it. Therefore, for HPC and regular applications, performance will be higher if the majority of memory requests hit nearby caches. If that is not the case, vectorizing and parallelizing the code can be ineffective. Learn to recognize and fix this situation.

Non-Uniform Memory Access (NUMA)

You need the compute power of multicore Intel® Xeon® processors, but the system’s DIMMs are no match for the needs of the many-vector processing units, therefore your program stalls. Learn to change your application’s data access characteristics so the L1 and L2 caches provide the needed 10,000+ GB/s. End the delays waiting for data from the 90+ GB/s DIMMs.