Unleash the Parallel Performance of Python* Programs

In the Beta release of Intel® Distribution for Python* we introduced an experimental module which unlocks additional performance for multi-threaded Python programs by enabling threading composability between two or more thread-enabled libraries.

Threading composability can accelerate programs by avoiding inefficient threads allocation (called oversubscription) when there are more software threads than available hardware resources.

The biggest improvement is achieved when a task pool like the ThreadPool from standard library or libraries like Dask or Joblib execute tasks calling compute-intensive functions of Numpy/Scipy/PyDAAL and others which in turn are parallelized using Intel® MKL or/and Intel® Threading Building Blocks (Intel® TBB).

The module implements Pool class with the standard interface using Intel® TBB which can be used to replace Python’s Pool and ThreadPool. Thanks to the monkey-patching technique, no source code change is needed in order to unlock additional speedups.


Everything is included into Intel® Distribution for Python already. You can also install tbb4py (starting from 2018.0.4 version or just tbb before that) from intel channel on anaconda.org.

Let’s try it!

For our example, we need Dask library, which makes parallelism very simple for Python:

source <path to Intel® Distribution for Python*>/bin/pythonvars.sh
conda install dask

Now, let’s write a simple program (QR verification) in bench.py which exploits nested parallelism and prints time spent for the computation, like the following:

import dask, time
import dask.array as da

x = da.random.random((100000, 2000), chunks=(10000, 2000))
t0 = time.time()

q, r = da.linalg.qr(x)
test = da.all(da.isclose(x, q.dot(r)))
assert(test.compute()) # compute(get=dask.threaded.get) by default

print(time.time() - t0)

Here, Dask splits the array into chunks and processes them in parallel using multiple threads. But each Dask task executes expensive operations, which are accelerated using Intel® MKL under the hood and thus multi-threaded by itself. It results in nested parallelism, which is handled best with Intel® TBB.

To run it as is (baseline):

python bench.py

And to unlock additional performance:

python -m tbb bench.py

That's it! Depending on machine configuration and in particular depending on number of processors, you can get about 50% reduction of the compute time for this particular example or even more for more processors.

Disclaimers: TBB module does not work well for blocking I/O operations, it is applicable only for tasks which do not block in the operating system. This version of TBB module is experimental and might be not sufficiently optimized and verified with different use-cases.

For additional details on how to use the TBB module, please refer to built-in documentation, e.g. run `pydoc tbb` and `python -m tbb --help`.

This module is available in sources as preview feature of Intel TBB 4.4 Update 5 release.

See also my talk "Composable Multi-threading for Python Libraries" at SciPy 2016 in Austin, Texas. More benchmarks can be found here: github.com/IntelPython/composability_bench

We’ll greatly appreciate your feedback! Please get back to me, especially if you are interested enough to use it in your production/every-day environment.

For more complete information about compiler optimizations, see our Optimization Notice.


Anton Malakhov (Intel)'s picture

Updated blog with the new name of the TBB module for Python package (tbb4py), mentioning of built-in help (python -m tbb --help), and other minor fixes

Anton Malakhov (Intel)'s picture

Hi, it looks like I missed a bunch of comments, sorry. Answering now.

Jupiter/IPython do not support it the same way as python itself since they are loaded before TBB can get a chance to monkey-patch the standard library. But if you run Jupiter/IPython as a module, you can use TBB's feature to load module the same way as it works with pure python, e.g.: `python -m tbb -m IPython`

It's not clear what one might want from Intel TBB without any positive context here. TBB is a native library which is popular for multithreading codes in general. Speaking about Python environment, TBB as a C++ library can be beneficial for building C extensions for Python. As a Python module for TBB, it can help any compute-intensive native C extensions built on top of TBB and Python's multithreading codes to better co-exist across separate components. However, it is rather limited to compute-intensive codes since over-subscription does not usually hurt I/O or even more, it helps I/O applications.

use `da.random.random` if you import dask as da.

Ranjan, Rajeev (Intel)'s picture

A very superficial question - if I am not using SciPy, NumPy or any other libraries which are compute intensive, Can I still exploit the goodness of Intel TBB?

kehw's picture

Hi Anton,

How can I use the "-m TBB" magic in jupyter notebook?

I observed oversubscription when I run code in jupyter notebook.


jay d.'s picture

I followed the instructions but am getting this:

(root) Stefan@SMALLPC C:\IntelPython3
> python bench.py
Traceback (most recent call last):
  File "bench.py", line 2, in <module>
    x = dask.array.random.random((100000, 2000), chunks=(10000, 2000))
AttributeError: module 'dask' has no attribute 'array'

'import dask' works fine so I know it's there.


Andrew D.'s picture


Thank you for posting example. I seem to have an issue where running "python -m TBB bench.py" provides a (19.9sec vs 21.9sec) performance improvement. But there is no evidence that intel Python is actually using the Phi cores - micsmc does not show a change in core utilization. 

micinfo, micheck seem to show that all is OK and I am using the intel Python and dask. 

Any thoughts?


gaston-hillar's picture


I do believe this feature is really great for Python developers that want to unleash parallel performance. It would be great to have more examples like this one.

Clay B.'s picture

Great example, Anton.

My first run of this ran into an OpenMP "problem." I put that in quotes because Anton reminded me that I should have expected it since the Intel Python uses MKL in NumPy. In the example, the Dask array library creates own threads and then makes calls to the threaded MKL. This all leads to a nested threading situation with multiple OpenMP parallel regions working concurrently. I was running on a platform with 88 logical threads; there were 88 Python threads and each MKL instance also started 88 threads! Too many threads (88*88=7744) for the machine resources.

I can control this by setting OMP_NUM_THREADS and/or MKL_NUM_THREADS to reasonable values. But, as Anton demonstrated, it is much easier to type "-m TBB" to the execution command and let TBB handle the nesting than it is to try to remember to set/reset a couple environment variables.

This will teach me to pay attention and read things more carefully in future. Thanks again for the help, Anton.


Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.