Overlapping parallel_for with CUDA
Hello, I'm implementing a heterogeneous matrix multiply on the CPU + GPU by using TBB (parallel_for) and MKL on the CPU and CUDA on the GPU. My code works well when the matrix is done completely on the CPU or completely on the GPU, however I'm having trouble getting the system to do work on both devices at once - CUDA and TBB/MKL refuse to overlap. It looks like parallel_for does not return until it is complete so I'm organizing my code like so: copyDataToGpu() launchMatrixMultiplyOnGpu() // <- non-blocking kernel call returns instantly launchMatrixMultiplyOnCpu() // <- uses TBB parallel_for synchronize() // <- waits for GPU code to finish copyDataFromGpu() However, it seems like the parallel_for call is blocking CUDA from continuing work since the time is exactly the time of the sum of the time it takes for CUDA and TBB to do their works individually. Is parallel_for known to block CUDA from working? I can overlap CUDA with simple CPU work like for loops just fine, but I would really like to use TBB. Alternatively, is there a form of parallel_for which is non-blocking which I can launch in the beginning of the computation (ideal, but does not seem likely). Thank you for any help!
| |
Overlapping parallel_for with CUDA
Please refer to task::enqueue or simple parallel_invoke to describe your overlapping (but in the second case you'll need to initialize TBB for at least two threads, otherwise it will execute tasks sequentially on a unicore machine).
| |
Overlapping parallel_for with CUDA
But why did it not work as presented? Does the real communication with the GPU only occur at synchronize() time, perhaps?
| |
Overlapping parallel_for with CUDA
I replaced the TBB parallel_for with Windows threads and that does indeed overlap fine with the GPU. I'm still curious why TBB blocks though.
| |
Overlapping parallel_for with CUDA
Could you tell us if there's something in the synchronize() that's required to send and/or kick off the GPU work as well as the actual synchronisation (waiting for the results)? Or maybe only part of the work is done?
| |
Overlapping parallel_for with CUDA
Try the following pseudo code
int SplitPointRow = nRows / 2; // for starters parallel_invoke( [&](){ doCPUpart(A,B,C, nRows, nCols, 0, SplitPointRow); }, [&](){ doGPUdoCPUpart(A,B,C, nRows, nCols, SplitPointRow, nRows); } ); ............ void doGPUpart(double** A,double** B,double** C, int nRows, int nCols, int RowBegin, RowEnd) { copyDataToGPU(...) launchGPUMatrixMultiply(...) synchronize(...) copyDataFromGPU(...) } void doCPUpart(... { // do subset of matrix }
As others have pointed out you will likely need to oversubscribe your threads by at least one thread. This is due to the thread issuing the synchronize() is going to stall. If you are on a single core system this will result in serialization. If you are on a single core system you would want something like
int SplitPointRow = nRows / 2; // for starters copyDataToGPU(...) launchGPUMatrixMultiply(...) doCPUpart(A,B,C, nRows, nCols, 0, SplitPointRow); synchronize(...) copyDataFromGPU(...)
Of course you code with two paths, one for single core, one for multi-core.
*** also, check your CUDA implementation as to multi-threaded programming issues.
Jim Dempsey
Blog: The Parallel Void
www.quickthreadprogramming.com | |
Overlapping parallel_for with CUDA
TBB is agnostic of CUDA (as well as any other 3rd party library except for language support RTLs), so it does not consciously do anything that would prevent your asynchronous CUDA computation to run. Honestly, I have no idea why the setup you described does not work as expected. Out of curiosity, what happens if you replace the parallel_for call with a long do-nothing loop?
The idea to invoke a separate thread that makes the CUDA call makes lots of sense to me. As others noted, since it will supposedly block you should better "oversubscribe" the system. The most natural way to do that, however, is not tbb::task::enqueue() or tbb::parallel_invoke() I think, but std::thread (it's available in TBB in case your compiler does not yet support this C++11 feature). In this case, you don't have to oversubscribe TBB because its workers are not impacted.
| |
Overlapping parallel_for with CUDA
Side question (really, not a rhetorical one): why don't you use OpenCL for the CPU as well as the GPU?
| |
Overlapping parallel_for with CUDA
The synchronize() call isn't completely required, it's actually called implicitly whenever a memcpy is done between the host and GPU memory. There isn't anything required besides the kernel call in order to kick off work on the GPU. I even tested launching some kernels on the GPU without any synchronization calls or memcpy afterwards and from my performance monitor it looks like they are still launched.
| |
Overlapping parallel_for with CUDA
Great! Thanks for the advice / code. I'm currently working on a new project and will try this for my GPU + CPU work splitting, I'll let you know how it works as soon as it's done (hopefully a week or two).
| |
Overlapping parallel_for with CUDA
I replaced the parallel_for with a long for loop that just incremented a counter and it had effectively no impact on the run-time until I forced the for loop took longer than the CUDA stuff. I set the loop to increment a counter and then print the counter after the full thing to make sure nothing was being optimized out.
| | |