In the Beta release of Intel® Distribution for Python* I am proud to introduce something new and unusual for the Python world. It is an experimental module which unlocks additional performance for multi-threaded Python programs by enabling threading composability between two or more thread-enabled libraries.
Threading composability can accelerate programs by avoiding inefficient threads allocation (called oversubscription) when there are more software threads than available hardware resources.
The biggest improvement is achieved when a task pool like the ThreadPool from standard library or libraries like Dask or Joblib (used in multi-threading mode) execute tasks calling compute-intensive functions of Numpy/Scipy/PyDAAL which in turn are parallelized using Intel® MKL or/and Intel® Threading Building Blocks (Intel® TBB).
The module implements Pool class with the standard interface using Intel® TBB which can be used to replace Python’s ThreadPool. Thanks to the monkey-patching technique implemented in class Monkey, no source code change is needed in order to unlock additional speedups.
Let’s try it!
Assuming you have installed Intel® Distribution for Python, we need to install Dask library which makes parallelism very simple for Python:
source <path to Intel® Distribution for Python*>/bin/pythonvars.sh
conda install dask
Now, let’s write a simple program (QR verification) in bench.py which exploits nested parallelism and prints time spent for the computation, like the following:
import dask, time import dask.array as da x = da.random.random((100000, 2000), chunks=(10000, 2000)) t0 = time.time() q, r = da.linalg.qr(x) test = da.all(da.isclose(x, q.dot(r))) assert(test.compute()) # compute(get=dask.threaded.get) by default print(time.time() - t0)
Here, Dask splits the array into chunks and processes them in parallel using multiple threads. But each Dask task executes expensive operations which are accelerated using Intel® MKL under the hood and thus multi-threaded by itself. It results in nested parallelism which is handled best with Intel® TBB.
To run it as is (baseline):
And to unlock additional performance:
python -m TBB bench.py
That's it! Depending on machine configuration and in particular depending on number of processors, you can get about 50% reduction of the compute time for this particular example or even more for more processors.
Disclaimers: TBB module does not work well for blocking I/O operations, it is applicable only for tasks which do not block in the operating system. This version of TBB module is experimental and might be not sufficiently optimized and verified with different use-cases.
For additional details on how to use the TBB module, please refer to built-in documentation, e.g. run `pydoc TBB`.
This module is available in sources as preview feature of Intel TBB 4.4 Update 5 release.
We’ll greatly appreciate your feedback! Please get back to me, especially if you are interested enough to use it in your production/every-day environment.