Get Started with Intel® Trace Analyzer and Collector

ID 758510
Date 9/27/2021
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Improve Your Application Performance by Changing Communications

Improve the performance of the MPI application by changing blocking to non-blocking communications.

  1. In your code replace the serial MPI_Sendrcv with non-blocking communication: MPI_Isend and MPI_Irecv. For example:

    Original code snippet:

    // boundary exchange
    void exchange(para* p, grid* gr)
    {
      int i,j;
      MPI_Status status_100, status_200, status_300, status_400;
      // send down first row
      MPI_Send(gr->x_new[1], gr->lcol+2, MPI_DOUBLE, gr->down, 100, MPI_COMM_WORLD);
      MPI_Recv(gr->x_new[gr->lrow+1], gr->lcol+2, MPI_DOUBLE, gr->up, 100, MPI_COMM_WORLD, &status_100);
      // send up last row
      MPI_Send(gr->x_new[gr->lrow], gr->lcol+2, MPI_DOUBLE, gr->up, 200, MPI_COMM_WORLD);
      MPI_Recv(gr->x_new[0], gr->lcol+2, MPI_DOUBLE, gr->down, 200, MPI_COMM_WORLD, &status_200);
    // copy left column to tmp arrays
     if(gr->left != MPI_PROC_NULL)
     {
       for(i=0; i< gr->lrow+2; i++)
       {
         left_col[i] = gr->x_new[i][1];
      }
       MPI_Send(left_col, gr->lrow+2, MPI_DOUBLE, gr->left, 300, MPI_COMM_WORLD);
      }
      if(gr->right != MPI_PROC_NULL)
      {
       MPI_Recv(right_col, gr->lrow+2, MPI_DOUBLE, gr->right, 300, MPI_COMM_WORLD, &status_300);
      // copy right column to tmp
      // copy received left column to ghost cells
      for(i=0; i< gr->lrow+2; i++)
      {
        gr->x_new[i][gr->lcol+1] = right_col[i];
        right_col[i] = gr->x_new[i][gr->lcol];
      }
      // send right
      MPI_Send(right_col, gr->lrow+2, MPI_DOUBLE, gr->right, 400, MPI_COMM_WORLD);
      }
      if(gr->left != MPI_PROC_NULL)
      {
        MPI_Recv(left_col, gr->lrow+2, MPI_DOUBLE, gr->left, 400, MPI_COMM_WORLD,&status_400);
        for(i=0; i< gr->lrow+2; i++)
        {
          gr->x_new[i][0] = left_col[i];
        }
      }
    }
    

    Updated code snippet:

    MPI_Request req[7];
    // send down first row
    MPI_Isend(gr->x_new[1], gr->lcol+2, MPI_DOUBLE, gr->down, 100, MPI_COMM_WORLD, &req[0]);
    MPI_Irecv(gr->x_new[gr->lrow+1], gr->lcol+2, MPI_DOUBLE, gr->up, 100, MPI_COMM_WORLD, &req[1]);
    .....
      MPI_Waitall(7, req, MPI_STATUSES_IGNORE);
    

    Once corrected, the single iteration of the revised application will look like the following example:

  2. Use the Intel Trace Analyzer Comparison view to compare the serialized application with the revised one. Compare two traces with the help of the Comparison View, going to View > Compare. The Comparison View looks similar to:

    In the Comparison View, you can see that using non-blocking communication helps to remove serialization and decrease the time of communication of processes.

NOTE:
For more information about node-level performance of your application, see documentation for the respective tools: Intel® VTune™ Profiler MPI Code Analysis and Analyzing Intel® MPI applications using Intel® Advisor.