The new MPI-3 non-blocking collectives offer potential improvements to application performance. These gains can be significant for the right application. But for some applications, you could end up lowering your performance by adding non-blocking collectives. I'm going to discuss what the non-blocking collectives are and show a kernel which can benefit from using MPI_Iallreduce.
Non-blocking collectives are new versions of collective functions that can return immediately to your application code. These versions can perform the collective operation in the background (as long as your MPI implementation supports this) while your application performs other work. If your application is structured such that you can begin a collective operation, perform some local work, and get the results from that collective operation later, then your application might benefit from using non-blocking collectives.
In order to see a benefit to non-blocking collectives, your application must be able to do enough work between when the collective begins and when the collective must be completed to offset the additional overhead of checking for collective completion. Larger message sizes will typically require more computation to offset moving the data to a communication buffer. If you have a small message size relative to the overlapping computation, you could see benefit.
Additionally, you must have sufficient system resources available. If you are already using all available system resources, then the MPI implementation cannot run the communication in parallel with your computation, and you will see no benefit, with possible performance degradation.
There are several steps to identifying how to improve your application's performance using non-blocking collectives. The first step is to determine how much of your application's total time is spent in collectives. If you have very little time spent in collectives, there is very little of the overall application available for improving, and switching to non-blocking collectives likely isn't worth the investment. You can look at the Summary Page in Intel® Trace Analyzer to quickly see if any of your top functions are collectives.
Once you have determined that there is sufficient time in collectives, you need to check if your application's workflow allows for non-blocking collectives. For example, if you calculate a dataset, immediately use that in a collective, and use the collective results immediately following the collective, you will need to rework your application flow before you can see any benefit to non-blocking collectives. But if you can calculate the dataset early and pass it into the collective as soon as it is calculated, then don't need to use it until later, you have a potential for non-blocking collectives.
Let's assume a code kernel with 3 arrays, each distributed across multiple ranks. The kernel gets the average of the first array and uses it to modify the second array. The minimum and maximum values in the second array are found, and used to modify the third array, along with the sum of the first array. The third array is then reduced to a single sum across all ranks. Pseudo-code:
MPI_Allreduce(A1,sumA1temp,MPI_SUM) avgA1=sum(sumA1temp(:))/(elements*ranks) A2(:)=A2(:)*avgA1 MPI_Allreduce(A2,minA2temp,MPI_MIN) MPI_Allreduce(A2,maxA2temp,MPI_MAX) A3(:)=A3(:)+avgA1 minA2=minval(minA2temp(:)) maxA2=maxval(maxA2temp(:)) A3(:)=A3(:)*(minA2+maxA2)*0.5 MPI_Allreduce(A3,sumA3temp,MPI_SUM) finalsum=sum(sumA3temp(:))
This kernel could gain performance by switching from MPI_Allreduce to MPI_Iallreduce for the minimum and maximum reductions on the second array. Pseudo-code:
MPI_Allreduce(A1,sumA1temp,MPI_SUM) avgA1=sum(sumA1temp(:))/(elements*size) A2(:)=A2(:)*avgA1 MPI_Iallreduce(A2,minA2temp,MPI_MIN,req2min) MPI_Iallreduce(A2,maxA2temp,MPI_MAX,req2max) A3(:)=A3(:)+avgA1 MPI_Wait(req2min) minA2=minval(minA2temp(:)) MPI_Wait(req2max) maxA2=maxval(maxA2temp(:)) A3(:)=A3(:)*(minA2+maxA2)*0.5 MPI_Allreduce(A3,sumA3temp,MPI_SUM) finalsum=sum(sumA3temp(:))
Exact improvements vary based on many factors, but could be over 50% reduction in kernel runtime. In a test on a dual-socket system with Intel® Xeon® E5-2697 v2 processors, running 12 ranks (all using shared memory for communications) with the Intel® MPI Library Version 5.0, the non-blocking version took 54% less time to complete the kernel with a 10000 element array of randomly generated doubles. This is because the collective is able to overlap the communication, allowing increased parallelism in the application as a whole.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804