In the Beta release of Intel® Distribution for Python* we have introduced an experimental module, which unlocks additional performance for Python programs by composing threads coming from different Python modules better, i.e. by improving threading composability of compute-intensive modules.
Better threading composability can accelerate programs by avoiding inefficient threads allocation (called oversubscription) and related performance issues when there are more software threads than available hardware resources.
The biggest improvement is achieved when a task pool like the ThreadPool from standard library or libraries like Dask or Joblib execute tasks calling compute-intensive functions of Numpy/Scipy/PyDAAL and others, which in turn are parallelized using Intel® MKL or/and Intel® Threading Building Blocks (Intel® TBB).
The module implements a Pool class with the standard interface using Intel® TBB, which is used to replace Python’s Pool and ThreadPool implementations. Thanks to the monkey-patching technique, no source code modification is needed in order to enable TBB-based parallelism.
Everything is included into Intel® Distribution for Python already. You can also install tbb4py (starting from 2018.0.4 version or just tbb before that) from intel, conda-forge, and default channels using conda install or from PyPi using pip install - just make sure that you are using Numpy with Intel® MKL. Also, oversubscription starts to impact performance with a big enough number of available CPUs, laptops are not usually affected.
For our example, we need Dask library, which makes nested parallelism implicit and natural for Python:
conda install dask
Now, let’s write a simple program (QR verification) in bench.py, which exploits nested parallelism and prints time spent for the computation, like the following:
import dask, time import dask.array as da x = da.random.random((100000, 2000), chunks=(10000, 2000)) t0 = time.time() q, r = da.linalg.qr(x) test = da.all(da.isclose(x, q.dot(r))) assert(test.compute()) # compute(scheduler="threads") by default print(time.time() - t0)
Here, Dask splits the array into chunks of specified sizes and processes them in parallel using P worker threads, where P = number of CPUs. Each Dask thread executes expensive numpy operations, which are accelerated using Intel® MKL under the hood and thus are multi-threaded on their own. It results in nested parallelism, with up to P2 threads in total when running with default OpenMP settings.
To run it as is (baseline):
And to unlock additional performance using TBB-based composable threading:
python -m tbb bench.py
That's it! On my machine with 48 logical CPUs, I can get up to 2x difference in time. Depending on particular machine configuration and the number of processors, this number can be significantly different for you.
Disclaimers: TBB module does not work well for blocking I/O operations, it is applicable only for tasks, which do not block in the operating system. This version of TBB module is experimental and might be not sufficiently optimized and might not support all the use-cases.
For additional details on how to use the TBB module, please refer to built-in documentation, e.g. run `pydoc tbb` and `python -m tbb --help`.
We’ll greatly appreciate your feedback and issues! Please get back to me, especially if you are interested enough to use it in your production/every-day environment.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804