Large Matrix Operations with SciPy* and NumPy*: Tips and Best Practices

Introduction

Large matrix operations are the cornerstones of many important numerical and machine learning applications. In this article, we provide some recommendations for using operations in Scipy or Numpy for large matrices with more than 5,000 elements in each dimension.

General Advice for Setting up Python*

Use the latest version of Intel® Distribution for Python* (version 2018.0.0 at the time of this writing), and preferably Python (version 3.6) for better memory deallocation and timing performance. This means downloading the Miniconda* installation script (that is, Miniconda3-latest-Linux-x86_64.sh) with Python (version 3.6). After downloading the script, the following lines are some example bash commands that you can use to execute the script for the installation of conda*, a Python package manager that we will use throughout this guide:

INSTALLATION_DIR=$HOME/miniconda3
bash $<DOWNLOAD_PATH>/Miniconda3-latest-Linux-x86_64.sh -b -p $INSTALLATION_DIR -f 
CONDA=${INSTALLATION_DIR}/bin/conda
ACTIVATE=${INSTALLATION_DIR}/bin/activate

For the installation of the Python packages from Intel, you can create a conda environment called idp as follows:

$CONDA create -y -c intel -n idp intelpython3_core python=3 

To activate the conda environment for use, run

source $ACTIVATE idp

Then you will be able to use the installed packages from the Intel Distribution for Python.

Tips for Using the Matrix Operations

These recommendations may help you obtain faster computational performance for large matrix operations on compatible Intel® processors. From our benchmarks, we see great speedups of these large matrix operations when used in parallel on the Intel processors that belong to the server class, such as the Intel® Xeon® processors and Intel® Xeon Phi™ processors. These speedups are a result of the parallel computation at the multithreading layer of the Numpy and Scipy libraries.

We based the five tips described below on the performance observed for the matrix operation benchmarks, which are version controlled. After activating the conda environment for the Intel Distribution for Python, you can install our benchmarks following these steps for the bash command line:

git clone https://github.com/IntelPython/ibench.git
cd ibench
python setup.py install

The example SLURM job script for running the benchmark on an Intel Xeon Phi processor (formerly code-named Knights Landing) processor can be found at the following link, while the example SLURM job script for an Intel Xeon processor can be found at the following link.

Tip 1: Tune the relevant environmental variable settings

To make the best use of the multithreading capabilities of server-class Intel processors, we recommend you tune the threading and memory allocation settings accordingly. Factors that can affect the performance of the matrix operations include:

  • Shape and size of the input matrix
  • Matrix operation used
  • Amount of computational resources on the system
  • Usage pattern of computational resource of the specific code base that runs the matrix operation(s)

For Intel® Xeon® Phi™ processors

Some example (bash environmental) settings are listed below as a baseline for tuning the performance.

The first set of parameters determines how many threads will be used.

export NUM_OF_THREADS=$(grep 'model name' /proc/cpuinfo | wc -l)
export OMP_NUM_THREADS=$(( $NUM_OF_THREADS / 4  ))
export MKL_NUM_THREADS=$(( $NUM_OF_THREADS / 4  ))

The parameters OMP_NUM_THREADS and MKL_NUM_THREADS specify the number of threads for the matrix operations. On an Intel Xeon Phi processor node dedicated for the computation job, we recommend using one thread per available core as a starting point for tuning the performance.

The second set of parameters specifies the behavior of each thread.

export KMP_BLOCKTIME=800
export KMP_AFFINITY=granularity=fine,compact
export KMP_HW_SUBSET=${OMP_NUM_THREADS}c,1t

The parameter KMP_BLOCKTIME specifies how long a thread should stay active after the completion of a compute task. When KMP_BLOCKTIME is longer than the default value of 200 milliseconds, there will be less overhead for waking up the thread for subsequent computation(s).

The KMP_AFFINITY parameter dictates the placement of neighboring OpenMP thread context. For instance, the setting of compact assigns the OpenMP thread <n>+1 to a free thread context as close as possible to the thread context where the <n> OpenMP thread was placed. A detailed guide for KMP_AFFINITY setting can be found at this link.

The KMP_HW_SUBSET setting specifies the number of threads placed onto each processor core.

The last parameter controls the memory allocation behavior.

export HPL_LARGEPAGE=1

The parameter HPL_LARGEPAGE enables large page size for the memory allocation of data objects. Having a huge page size can enable more efficient memory patterns for large matrices due to better translation lookaside buffer behavior.

For Intel® Xeon® processors (processors released with a code-name newer than Haswell)

Some (bash environmental) example settings are:

export NUM_OF_THREADS=$(grep 'model name' /proc/cpuinfo | wc -l)
export OMP_NUM_THREADS=$(( $NUM_OF_THREADS ))
export MKL_NUM_THREADS=$(( $NUM_OF_THREADS ))
export KMP_HW_SUBSET=${OMP_NUM_THREADS}c,1t  
export HPL_LARGEPAGE=1
export KMP_BLOCKTIME=800

Tip 2: Use Fortran* memory layout for Numpy arrays (assuming matrix operations are called repeatedly)

Numpy arrays can use either Fortran memory layout (column-major order) or C memory layout (row-major order). Many Numpy and Scipy APIs are implemented with LAPACK and BLAS, which require Fortran memory layout. If a C memory layout Numpy array is passed to a Numpy or Scipy API that uses Fortran order internally, it will perform a costly transpose first. If a Numpy array is used repeatedly, convert it to Fortran order before the first use.

The Python function that can enable this memory layout conversion is numpy.asfortranarray. Here is a short code example:

import numpy as np 
matrix_input = np.random.rand(5000, 5000)
matrix_fortran = np.asfortranarray(matrix_input, dtype=matrix_input.dtype)

Tip 3: Save the result of a matrix operation in the input matrix (kwargs: overwrite_a=True)

It is natural to obtain large outputs from matrix operations that have large matrices as inputs. The creation of additional data structures can add overhead.

Many Scipy matrix linear algebra functions have an optional parameter called overwrite_a, which can be set to True. This option makes the function provide the result by overwriting an input instead of allocating a new Numpy array. Using the LU decomposition from the Scipy library as an example, we have:

import scipy
scipy.linalg.lu(a=matrix, overwrite_a=True)

Tip 4: Use the TBB or SMP Python modules to avoid the oversubscription of threads

Efficient parallelism is a known recipe for unlocking the performance on an Intel processor with more than one core. The reasoning about parallelism within Python, however, is sometimes less than transparent. Individual Python libraries may implement their own mechanism and level of parallelism. When a combination of Python libraries is used, it can result in an unintended usage pattern of the computational resources. For example, some Python modules (for example, multiprocessing) and libraries (for example, Scikit-Learn*) introduce parallelism into some functions by forking multiple Python processes. Each of these Python processes can spin up a number of threads as specified by the library.

Sometimes the number of threads can be bigger than the available amount of CPU resources in a way that is known as thread oversubscription. Thread oversubscription can slow down a computational job. Within the Intel Distribution for Python, several modules help to manage threads more efficiently by avoiding oversubscription. These are the Static Multi-Processing library (SMP) and the Python module for Intel® Threading Building Blocks (Intel® TBB).

In Table 1, Intel engineers Anton Gorshkov and Anton Malakhov provide recommendations for which module to use for a Python application based on the parallelism characteristics, in order to achieve good thread subscription for parallel Python applications. These recommendations can serve as a starting point for tuning the parallelism.

Table 1. Module recommendations based on parallelism characteristics.

Innermost parallelism level

Outermost parallelism level

Balanced work

Unbalanced work

Low subscription

High subscription

 

Low subscription

Python

Python with SMP

Python with Intel® Threading Building Blocks

High subscription

KMP_COMPOSABILITY

Below we give the exact command line instructions for each entry in the table.

The SMP and Intel TBB modules can be easily installed in an Anaconda* or a Miniconda installation of Python using the conda installer commands:

conda install -c intel smp  
conda install -c intel tbb

More specifically, we explain the bash commands for each possible scenario outlined in Table 1. Assume the Python script that we try to run is called PYTHONPROGRAM.py, the syntax for invoking the script at a bash command line is:

python PYTHONPROGRAM.py                  # No change 

If we want to use the Python modules with the script, we invoke one of the Python modules by supplying one of them as the argument of the -m flag of the Python executable:

python -m smp PYTHONPROGRAM.py           # Use SMP with Python script
python -m tbb PYTHONPROGRAM.py           # Use TBB with Python script

Additionally, the TBB module has various options. For example, we can supply a -p flag with Tbb to limit the maximum number of threads:

python -m tbb -p $MAX_NUM_OF_THREADS PYTHONPROGRAM.py

And for a Python script with high thread subscription in the inner parallelism level and unbalanced work, we can set the variable KMP_COMPOSABILITY for a bash shell as follows:

KMP_COMPOSABILITY=mode=exclusive python

To read about additional resources for Intel TBB and SMP, we recommend the SciPy 2017 tutorial given by Anton Malakhov. Or you can find a tutorial at Unleash the Parallel Performance of Python* Programs.

Tip 5: Turn on transparent hugepages for memory allocation on Cray Linux*

The Linux* OS has a feature called hugepages that can accelerate programs with a large memory footprint by 50 percent or greater. For many Linux systems, the hugepages functionality is transparently provided by the kernel. Enabling hugepages for Python on Cray Linux requires some extra steps. Our team at Intel has verified that the presence of hugepages is needed to achieve the best possible performances for some large matrix operations (> 5000 x 5000 matrix size). These operations include LU, QR, SVD, and Cholesky decompositions.

First, on the relevant Cray system, check that the system hugepages module is loaded. For example, if Tcl* modules are used, you can find and load the system hugepages module via instructions similar to:

module avail craype-hugepages2M
module load craype-hugepages2M

You will also need to run the one-line ainstallation instruction of the hugetlbfs package within a Conda Python environment or installation:

You will also need to run the one-line installation instruction of the hugetlbfs package within a Conda Python environment or installation:

conda install -c intel hugetlbfs

After that the Python binary within the conda environment will allocate the memory for objects using hugepages.

For more complete information about compiler optimizations, see our Optimization Notice.