Unleash the Parallel Performance of Python* Programs

[updated 10/5/2018]

Threading composability

In the Beta release of Intel® Distribution for Python* we have introduced an experimental module, which unlocks additional performance for Python programs by composing threads coming from different Python modules better, i.e. by improving threading composability of compute-intensive modules.

Better threading composability can accelerate programs by avoiding inefficient threads allocation (called oversubscription) and related performance issues when there are more software threads than available hardware resources.

The biggest improvement is achieved when a task pool like the ThreadPool from standard library or libraries like Dask or Joblib execute tasks calling compute-intensive functions of Numpy/Scipy/PyDAAL and others, which in turn are parallelized using Intel® MKL or/and Intel® Threading Building Blocks (Intel® TBB).

The module implements a Pool class with the standard interface using Intel® TBB, which is used to replace Python’s Pool and ThreadPool implementations. Thanks to the monkey-patching technique, no source code modification is needed in order to enable TBB-based parallelism.


Everything is included into Intel® Distribution for Python already. You can also install tbb4py (starting from 2018.0.4 version or just tbb before that) from intelconda-forge, and default channels using conda install or from PyPi using pip install - just make sure that you are using Numpy with Intel® MKL. Also, oversubscription starts to impact performance with a big enough number of available CPUs, laptops are not usually affected.

Let’s try it!

For our example, we need Dask library, which makes nested parallelism implicit and natural for Python:

conda install dask

Now, let’s write a simple program (QR verification) in bench.py, which exploits nested parallelism and prints time spent for the computation, like the following:

import dask, time
import dask.array as da

x = da.random.random((100000, 2000), chunks=(10000, 2000))
t0 = time.time()

q, r = da.linalg.qr(x)
test = da.all(da.isclose(x, q.dot(r)))
assert(test.compute()) # compute(scheduler="threads") by default

print(time.time() - t0)

Here, Dask splits the array into chunks of specified sizes and processes them in parallel using P worker threads, where P = number of CPUs. Each Dask thread executes expensive numpy operations, which are accelerated using Intel® MKL under the hood and thus are multi-threaded on their own. It results in nested parallelism, with up to P2 threads in total when running with default OpenMP settings.

To run it as is (baseline):

python bench.py

And to unlock additional performance using TBB-based composable threading:

python -m tbb bench.py

That's it! On my machine with 48 logical CPUs, I can get up to 2x difference in time. Depending on particular machine configuration and the number of processors, this number can be significantly different for you.

Disclaimers: TBB module does not work well for blocking I/O operations, it is applicable only for tasks, which do not block in the operating system. This version of TBB module is experimental and might be not sufficiently optimized and might not support all the use-cases.

For additional details on how to use the TBB module, please refer to built-in documentation, e.g. run `pydoc tbb` and `python -m tbb --help`.

See also

We’ll greatly appreciate your feedback and issues! Please get back to me, especially if you are interested enough to use it in your production/every-day environment.

For more complete information about compiler optimizations, see our Optimization Notice.


Anton Malakhov (Intel)'s picture

Updated blog with the new name of the TBB module for Python package (tbb4py), mentioning of built-in help (python -m tbb --help), and other minor fixes

Anton Malakhov (Intel)'s picture

Hi, it looks like I missed a bunch of comments, sorry. Answering now.

Jupiter/IPython do not support it the same way as python itself since kernels is loaded before in a separate process which is unaffected by monkey-patching. However, if you configure IPython kernels to run with `-m tbb`, it will work the same way. Please check out https://github.com/IntelPython/composability_bench/tree/master/scipy2018_demo#jupyter-notebook for an example.

It's not clear what one might want from Intel TBB without any positive context here. TBB is a native library which is popular for multithreading codes in general. Speaking about Python environment, TBB as a C++ library can be beneficial for building C extensions for Python. As a Python module for TBB, it can help any compute-intensive native C extensions built on top of TBB and Python's multithreading codes to better co-exist across separate components. However, it is rather limited to compute-intensive codes since over-subscription does not usually hurt I/O or even more, it helps I/O applications.

use `da.random.random` if you import dask as da.

Ranjan, Rajeev (Intel)'s picture

A very superficial question - if I am not using SciPy, NumPy or any other libraries which are compute intensive, Can I still exploit the goodness of Intel TBB?

kehw's picture

Hi Anton,

How can I use the "-m TBB" magic in jupyter notebook?

I observed oversubscription when I run code in jupyter notebook.


jay d.'s picture

I followed the instructions but am getting this:

(root) Stefan@SMALLPC C:\IntelPython3
> python bench.py
Traceback (most recent call last):
  File "bench.py", line 2, in <module>
    x = dask.array.random.random((100000, 2000), chunks=(10000, 2000))
AttributeError: module 'dask' has no attribute 'array'

'import dask' works fine so I know it's there.


Andrew D.'s picture


Thank you for posting example. I seem to have an issue where running "python -m TBB bench.py" provides a (19.9sec vs 21.9sec) performance improvement. But there is no evidence that intel Python is actually using the Phi cores - micsmc does not show a change in core utilization. 

micinfo, micheck seem to show that all is OK and I am using the intel Python and dask. 

Any thoughts?


gaston-hillar's picture


I do believe this feature is really great for Python developers that want to unleash parallel performance. It would be great to have more examples like this one.


Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.