Unleash the Parallel Performance of Python* Programs

By Anton Malakhov,

Published:04/04/2016   Last Updated:04/04/2016

[updated 10/5/2018]

Threading composability

In the Beta release of Intel® Distribution for Python* we have introduced an experimental module, which unlocks additional performance for Python programs by composing threads coming from different Python modules better, i.e. by improving threading composability of compute-intensive modules.

Better threading composability can accelerate programs by avoiding inefficient threads allocation (called oversubscription) and related performance issues when there are more software threads than available hardware resources.

The biggest improvement is achieved when a task pool like the ThreadPool from standard library or libraries like Dask or Joblib execute tasks calling compute-intensive functions of Numpy/Scipy/PyDAAL and others, which in turn are parallelized using Intel® MKL or/and Intel® Threading Building Blocks (Intel® TBB).

The module implements a Pool class with the standard interface using Intel® TBB, which is used to replace Python’s Pool and ThreadPool implementations. Thanks to the monkey-patching technique, no source code modification is needed in order to enable TBB-based parallelism.


Everything is included into Intel® Distribution for Python already. You can also install tbb4py (starting from 2018.0.4 version or just tbb before that) from intelconda-forge, and default channels using conda install or from PyPi using pip install - just make sure that you are using Numpy with Intel® MKL. Also, oversubscription starts to impact performance with a big enough number of available CPUs, laptops are not usually affected.

Let’s try it!

For our example, we need Dask library, which makes nested parallelism implicit and natural for Python:

conda install dask

Now, let’s write a simple program (QR verification) in bench.py, which exploits nested parallelism and prints time spent for the computation, like the following:

import dask, time
import dask.array as da

x = da.random.random((100000, 2000), chunks=(10000, 2000))
t0 = time.time()

q, r = da.linalg.qr(x)
test = da.all(da.isclose(x, q.dot(r)))
assert(test.compute()) # compute(scheduler="threads") by default

print(time.time() - t0)

Here, Dask splits the array into chunks of specified sizes and processes them in parallel using P worker threads, where P = number of CPUs. Each Dask thread executes expensive numpy operations, which are accelerated using Intel® MKL under the hood and thus are multi-threaded on their own. It results in nested parallelism, with up to P2 threads in total when running with default OpenMP settings.

To run it as is (baseline):

python bench.py

And to unlock additional performance using TBB-based composable threading:

python -m tbb bench.py

That's it! On my machine with 48 logical CPUs, I can get up to 2x difference in time. Depending on particular machine configuration and the number of processors, this number can be significantly different for you.

Disclaimers: TBB module does not work well for blocking I/O operations, it is applicable only for tasks, which do not block in the operating system. This version of TBB module is experimental and might be not sufficiently optimized and might not support all the use-cases.

For additional details on how to use the TBB module, please refer to built-in documentation, e.g. run `pydoc tbb` and `python -m tbb --help`.

See also

We’ll greatly appreciate your feedback and issues! Please get back to me, especially if you are interested enough to use it in your production/every-day environment.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.