Unleash the Parallel Performance of Python* Programs

In the Beta release of Intel® Distribution for Python* I am proud to introduce something new and unusual for the Python world. It is an experimental module which unlocks additional performance for multi-threaded Python programs by enabling threading composability between two or more thread-enabled libraries.

Threading composability can accelerate programs by avoiding inefficient threads allocation (called oversubscription) when there are more software threads than available hardware resources.

The biggest improvement is achieved when a task pool like the ThreadPool from standard library or libraries like Dask or Joblib (used in multi-threading mode) execute tasks calling compute-intensive functions of Numpy/Scipy/PyDAAL which in turn are parallelized using Intel® MKL or/and Intel® Threading Building Blocks (Intel® TBB).

The module implements Pool class with the standard interface using Intel® TBB which can be used to replace Python’s ThreadPool. Thanks to the monkey-patching technique implemented in class Monkey, no source code change is needed in order to unlock additional speedups.

Let’s try it!

Assuming you have installed Intel® Distribution for Python, we need to install Dask library which makes parallelism very simple for Python:

source <path to Intel® Distribution for Python*>/bin/pythonvars.sh
conda install dask

Now, let’s write a simple program (QR verification) in bench.py which exploits nested parallelism and prints time spent for the computation, like the following:

import dask, time
import dask.array as da

x = da.random.random((100000, 2000), chunks=(10000, 2000))
t0 = time.time()

q, r = da.linalg.qr(x)
test = da.all(da.isclose(x, q.dot(r)))
assert(test.compute()) # compute(get=dask.threaded.get) by default

print(time.time() - t0)

Here, Dask splits the array into chunks and processes them in parallel using multiple threads. But each Dask task executes expensive operations which are accelerated using Intel® MKL under the hood and thus multi-threaded by itself. It results in nested parallelism which is handled best with Intel® TBB.

To run it as is (baseline):

python bench.py

And to unlock additional performance:

python -m TBB bench.py

That's it! Depending on machine configuration and in particular depending on number of processors, you can get about 50% reduction of the compute time for this particular example or even more for more processors.

Disclaimers: TBB module does not work well for blocking I/O operations, it is applicable only for tasks which do not block in the operating system. This version of TBB module is experimental and might be not sufficiently optimized and verified with different use-cases.

For additional details on how to use the TBB module, please refer to built-in documentation, e.g. run `pydoc TBB`.

This module is available in sources as preview feature of Intel TBB 4.4 Update 5 release.

See also my talk "Composable Multi-threading for Python Libraries" at SciPy 2016 in Austin, Texas. More benchmarks can be found here: github.com/IntelPython/composability_bench

We’ll greatly appreciate your feedback! Please get back to me, especially if you are interested enough to use it in your production/every-day environment.

For more complete information about compiler optimizations, see our Optimization Notice.


kehw's picture

Hi Anton,

How can I use the "-m TBB" magic in jupyter notebook?

I observed oversubscription when I run code in jupyter notebook.


jay d.'s picture

I followed the instructions but am getting this:

(root) Stefan@SMALLPC C:\IntelPython3
> python bench.py
Traceback (most recent call last):
  File "bench.py", line 2, in <module>
    x = dask.array.random.random((100000, 2000), chunks=(10000, 2000))
AttributeError: module 'dask' has no attribute 'array'

'import dask' works fine so I know it's there.


Andrew D.'s picture


Thank you for posting example. I seem to have an issue where running "python -m TBB bench.py" provides a (19.9sec vs 21.9sec) performance improvement. But there is no evidence that intel Python is actually using the Phi cores - micsmc does not show a change in core utilization. 

micinfo, micheck seem to show that all is OK and I am using the intel Python and dask. 

Any thoughts?


gaston-hillar's picture


I do believe this feature is really great for Python developers that want to unleash parallel performance. It would be great to have more examples like this one.

Clay B.'s picture

Great example, Anton.

My first run of this ran into an OpenMP "problem." I put that in quotes because Anton reminded me that I should have expected it since the Intel Python uses MKL in NumPy. In the example, the Dask array library creates own threads and then makes calls to the threaded MKL. This all leads to a nested threading situation with multiple OpenMP parallel regions working concurrently. I was running on a platform with 88 logical threads; there were 88 Python threads and each MKL instance also started 88 threads! Too many threads (88*88=7744) for the machine resources.

I can control this by setting OMP_NUM_THREADS and/or MKL_NUM_THREADS to reasonable values. But, as Anton demonstrated, it is much easier to type "-m TBB" to the execution command and let TBB handle the nesting than it is to try to remember to set/reset a couple environment variables.

This will teach me to pay attention and read things more carefully in future. Thanks again for the help, Anton.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.