In the Beta release of Intel® Distribution for Python* we introduced an experimental module which unlocks additional performance for multi-threaded Python programs by enabling threading composability between two or more thread-enabled libraries.
Threading composability can accelerate programs by avoiding inefficient threads allocation (called oversubscription) when there are more software threads than available hardware resources.
The biggest improvement is achieved when a task pool like the ThreadPool from standard library or libraries like Dask or Joblib execute tasks calling compute-intensive functions of Numpy/Scipy/PyDAAL and others which in turn are parallelized using Intel® MKL or/and Intel® Threading Building Blocks (Intel® TBB).
The module implements Pool class with the standard interface using Intel® TBB which can be used to replace Python’s Pool and ThreadPool. Thanks to the monkey-patching technique, no source code change is needed in order to unlock additional speedups.
Everything is included into Intel® Distribution for Python already. You can also install tbb4py (starting from 2018.0.4 version or just tbb before that) from intel channel on anaconda.org.
Let’s try it!
For our example, we need Dask library, which makes parallelism very simple for Python:
source <path to Intel® Distribution for Python*>/bin/pythonvars.sh
conda install dask
Now, let’s write a simple program (QR verification) in bench.py which exploits nested parallelism and prints time spent for the computation, like the following:
import dask, time import dask.array as da x = da.random.random((100000, 2000), chunks=(10000, 2000)) t0 = time.time() q, r = da.linalg.qr(x) test = da.all(da.isclose(x, q.dot(r))) assert(test.compute()) # compute(get=dask.threaded.get) by default print(time.time() - t0)
Here, Dask splits the array into chunks and processes them in parallel using multiple threads. But each Dask task executes expensive operations, which are accelerated using Intel® MKL under the hood and thus multi-threaded by itself. It results in nested parallelism, which is handled best with Intel® TBB.
To run it as is (baseline):
And to unlock additional performance:
python -m tbb bench.py
That's it! Depending on machine configuration and in particular depending on number of processors, you can get about 50% reduction of the compute time for this particular example or even more for more processors.
Disclaimers: TBB module does not work well for blocking I/O operations, it is applicable only for tasks which do not block in the operating system. This version of TBB module is experimental and might be not sufficiently optimized and verified with different use-cases.
For additional details on how to use the TBB module, please refer to built-in documentation, e.g. run `pydoc tbb` and `python -m tbb --help`.
This module is available in sources as preview feature of Intel TBB 4.4 Update 5 release.
We’ll greatly appreciate your feedback! Please get back to me, especially if you are interested enough to use it in your production/every-day environment.