Performance degradation with MPI_THREAD_MULTIPLE

Performance degradation with MPI_THREAD_MULTIPLE

Hello,

We've been facing some scalability issues with our MPI-based distributed application, so we decided to write a simple microbenchmark program, mimicking the same "all-to-all, pairwise" communication pattern, in an attempt to better understand where the performance degradation comes from. Our results indicate that increasing the number of threads severly impacts the overall throughput we can achieve. This is a rather surprising conclusion, so we wanted to check with you whether this is to be expected, or maybe it's a performance issue of the library itself.

So, what we're trying to do is to hash-partition some data from a set of producers to a set of consumers. In our microbenchmark, however, we've completely eliminated all computation, so we're just sending the same data over and over again. We have one or more processes per node, with t "sender" threads and t "receiver" threads. Below is the pseudocode for the sender and receiver bodies (initialization and finalization code excluded for brevity):

Sender:

for (iter = 0; iter < num_iters; iter++) {
    //pre-process input data, compute hashes, etc. (only in our real application)
    for (dest = 0; dest < num_procs * num_threads; dest++) {
        active_buf = iter % num_bufs;
        MPI_Wait(req[dest][active_buf], MPI_STATUS_IGNORE);
        //copy relevant data into buf[dest][active_buf] (only in our real application)
        MPI_Issend(buf[dest][active_buf], buf_size, MPI_BYTE,
                    dest/num_threads, dest, MPI_COMM_WORLD,
                    &req[dest][active_buf]);
    }
}

Receiver:

for (iter = 0; iter < num_iters; iter++) {
    MPI_Waitany(num_bufs, buf, &active_buf, MPI_STATUS_IGNORE);
    //process data in buf[active_buf] (only in our real application)
    MPI_Irecv(buf[active_buf], buf_size, MPI_BYTE,
                MPI_ANY_SOURCE, receiver_id, MPI_COMM_WORLD,
                &req[active_buf]);
}

 

For our experiments, we always used the same 6 nodes (2 CPUs x 6 cores, Infiniband interconnect) and kept the following parameters fixed:

  • num_iters = 150
  • buf_size = 128kb
  • num_bufs = 2, for both senders and receivers (note that senders have 2 buffers per destination, while receivers have 2 buffers in total)

Varying the number of threads and the number of processes per node, we obtained the following results:

total data size    1 process per node     4 threads per process   1 thread per process
--------------------------------------------------------------------------------------
     0.299 GB        4t : 13.088 GB/s       6x1p : 13.498 GB/s     6x 4p : 11.838 GB/s
     1.195 GB        8t : 12.490 GB/s       6x2p : 12.502 GB/s     6x 8p : 10.203 GB/s
     2.689 GB       12t :  6.861 GB/s       6x3p : 12.199 GB/s     6x12p : 10.018 GB/s
     4.781 GB       16t :  4.824 GB/s       6x4p : 11.808 GB/s     6x16p :  9.750 GB/s
     7.471 GB       20t :  3.950 GB/s       6x5p : 11.534 GB/s     6x20p :  9.485 GB/s
    10.758 GB       24t :  3.610 GB/s       6x6p :  9.784 GB/s     6x24p :  9.225 GB/s
 

Notes:

  • by t threads we actually mean t sender threads and t receiver threads, so each process consists of 2*t+1 threads in total
  • we realize that we're overallocating, as our machines have only 12 cores, but we thought that it shouldn't be an issue in our microbenchmark, as we're barely doing any computation and we're completely network bound
  • we noticed with "top" that there is always one core with 100% utilization per process

As you can see, there is a significant difference between e.g. running 1 process per node with 20 threads (~4 GB/s) and 5 process per node with 4 threads each (~11.5 GB/s). It looks to us like the implementation of MPI_THREAD_MULTIPLE uses some shared resources, which becomes very inefficient as contention increases.

So our questions are:

  1. Is this performance degradation to be expected / a known fact? We couldn't find any resources that would indicate so.
  2. Are our speculations anywhere close to reality? If so, is there a way to increase the number of shared resources in order to decrease contention and maintain the same throughput while still running 1 process per node?
  3. In general, is there a better, more efficient way we could implement this n-to-n data partitioning? I assume this is quite a common problem.

Any feedback is highly appreciated and we are happy to provide any additional information that you may request.

Regards,
- Adrian

 

19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

We recognize that it is quite a time-consuming task for people to take a close look at what we have done in order to provide meaningful answers, but any general idea or remark would be highly appreciated!

Can you attach the benchmark code you are using?

Sure. Here it is: http://pastie.org/private/vafgyugr31pjr8ykqd6gzw

Many thanks in advance and looking forward to your reply.

Best regards,

Adrian

I'm still investigating, but I've noticed a few things thus far:

Your function void* send(void*) appears to be overloading something in DAPL*.  I kept getting segfaults, but when I renamed it the segfaults stopped.

On line 180, you have a MPI_Ssend that is not correctly received.  It gets picked up by the MPI_Irecv on line 216.  Mismatch in data type and buffer size.

The timing measurement you are using is highly variable.  I noticed at least 10% variation in the times between one run and the next, with the same executable, arguments, and nodes.

I'll take another look at it tomorrow, but my guess is that you're just using up your resources.

Thank you for your initial investigation, James!

The MPI_Ssend was indeed supposed to be caught by that particular MPI_Irecv, but there was clearly a mistake in that the data types and message size did not match. Thank you for pointing it out. However, I think in this case it was benign, with no influence on the program's performance, as we're still experiencing the same behaviour after fixing it.

You're also right in that there is quite some variation in the times we're measuring. However, 10% is not enough to justify the differences that we're seeing between the "1 process per node" and "4 threads per process" columns.

Could you please elaborate on "you're just using up your resources"?

Looking forward to your feedback after you take a second look at this.

Best regards,

Adrian

Using up resources:  When you have too many threads, your system must switch which thread gets to run, instead of all threads being able to run.  As such, some threads will be delayed.  Depending on the communication fabric you use, additional communication threads could be spawned beyond what your program is using.

I need some clarification on what you're comparing.  When you say 1 process per node vs. 4 threads per process, what exactly are you comparing?  I'm currently running a set of tests on your code, which I think mimic what you're looking for.  Set 1 consists of 48 execution units (ranks * threads), all on one node, with varying numbers of ranks and threads.  Set 2 is similar, except with 24 execution units instead of 48 (to measure full subscription vs. half subscription).  I'm testing with both shared memory and DAPL* (independently of each other).  Data per execution unit (set by your -dpp option) is 0.25 GB.

Actually, one change to that.  The cluster I'm testing on was recently upgraded, so I'm not fully subscribing the node.  I'm only using 48 of 56 cores.  I'll run an oversubscribed test once these are done to measure impact.

So, let's say we have N nodes with C cores each and X streams (distributed evenly over the N nodes) of random input data that we want to (hash-)partition into X output streams.

What I was actually trying to understand is: All else being equal, is it more efficient to run...

  • (A) one process per node, with X/N threads processing those streams, or
  • (B) X/N processes per node, with one thread each?

I experimented with N = 6, C = 12 and increasing values of X: 24, 48, 72, 96, 120, 144. The latter correspond to the rows of our table. (A) corresponds to the "1 process per node" column and (B) to "1 thread per process". The "4 threads" column is some middle ground between these two extremes.

Results show that the performance of (A) is rapidly degrading as X increases.

The subscription ratio, in these terms, is equal to (X/N)*2 / C. The 2 there is due to the fact that we run one thread for the input stream and one for the output stream. Therefore, with X=48 (2nd row), we are already oversubscribed, so the performance drop is understandable. However, it is interesting that it is so much worse for (A) than for (B).

Unfortunately, in our use-case only approach (A) is feasible. Is there any hope?

Here's what I've got so far.  The trend varies depending on what fabric you are using.  With shared memory, the performance drops until you get down to 1 rank, then it comes back up.  With DAPL*, the performance first increases as you switch to threads, then decreases, with another increase back at 1 rank.  All of my testing thus far has been done on a single node.

Looking at the trace of the reproducer, it appears to be serializing.  Each successive rank is slower than the one before it.  This could be the cause of the performance degradation, as the effect is more pronounced with more threads.  Is it possible to use MPI_Send and MPI_Isend instead of MPI_Ssend and MPI_Issend?  Or (and I don't think this will work based on your descriptions) switch to a collective?

I'm going to pass this to our developers, because my guess is that we have optimizations that are targeted at multiple ranks rather than multiple threads.

As a note, I don't know how practical this would be in your real application, but you'll get better performance by sending a single, larger message rather than multiple smaller messages.  For the reproducer here, I set the number of send and receive buffers to 1, and the buffer size to the full data size.

Changing to MPI_(I)send would make our program incorrect, unless we make some radical design changes. Switching to collective communication would be even harder, if not impossible. Adjusting the size and number of buffers is something that we can do, though.

Thank you for following through on this, James - we really appreciate it. Looking forward with interest to the development team's answer.

Curious to know if there is any update on this that so far... Thanks in advance!

No updates yet.

Running the microbenchmark program with 20 sender and 20 receiver threads and  message size 256kb (that's when we notice severe performance degradation), we've noticed that the calls to MPI_Issend / MPI_Irecv take way longer (we measured them with clock()):

avg_issend: 656.000; avg_irecv: 1717.310

compared to running with 2 receiver and 2 sender threads:

avg_issend: 20.000; avg_irecv: 39.139

For what it's worth, below you can find a perf report, showing lots of time spent in some unknown function called by Irecv.

-  43.98%  dxhs_sem  [.] MPIR_Sendq_forget
   - MPIR_Sendq_forget
      - 99.98% MPIR_Request_complete
         + 98.81% PMPI_Wait
         + 1.19% PMPI_Waitall                   
-  38.24%  dxhs_sem  [.] 0x0000000000260f23     
   - 0x7fb9566d3f23        
   - 0x7fb9566d4e79        
   - rtc_register          
   - MPIDI_CH3I_GEN2_Prepare_rndv               
      - 56.50% MPIDI_gen2_Prepare_rndv_cts      
           MPID_nem_gen2_module_iStartContigMsg
         - MPIDI_CH3_iStartMsg                  
            - 86.45% MPIDI_CH3_RecvRndv         
                 MPID_nem_lmt_RndvRecv          
                 MPID_Irecv
                 PMPI_Irecv
                 mpi_irecv
                 recv_fcn  
                 start_thread                   
            - 13.55% MPIDI_CH3_PktHandler_RndvReqToSend
                 MPID_nem_netmod_handle_pkt     
                 0x7fb95677cdb6                 
                 0x7fb95677be6e
                 0x7fb9567774a8
               + MPID_nem_gen2_module_poll
      + 43.50% 0x7fb95677cc21
   + 0x7fb9566d4bbf        
   + rtc_register          
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d42dd        
   + 0x7fb9566d4c1c        
   + rtc_register          
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4bd5        
   + rtc_register          
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4c07        
   + rtc_register          
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d432b        
   + 0x7fb9566d3f40        
   + 0x7fb9566d4e79        
   + rtc_register          
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d3f27        
   + 0x7fb9566d4e79        
   + rtc_register          
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4319        
   + 0x7fb9566d42e8        
   + 0x7fb9566d4bf9        
   + rtc_register          
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d42c6        
   + 0x7fb9566d4bc7        
   + rtc_register          
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4c0e        
   + rtc_register          
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4c78        
   + rtc_register          
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4c25       

...

Adrian,

What is your full application impacted by this?  If you would prefer it not be public, you can either send me a private message here or email me directly (PM for email address).

Our full application is a commercial distributed RDBMS engine and the microbenchmark program simulates the component that hash-partitions the data.

I will ask my superiors if it is possible to share the source code with you (under some kind of an agreement or something), but I am not sure how it will help with the investigation, given the additional level of complexity that it will introduce.

Is there anything in particular that you are after? I will try to assist to the best of my knowledge. Or is there maybe a way that we can work together on this in a more streamlined fashion?

If you want to go the route of sharing your source code with us, I would highly recommend filing this on Intel® Premier Support (https://businessportal.intel.com), rather than going through our forums.  All communications on Intel® Premier Support are covered under NDA.  I don't know how necessary it would be, but our developers can more easily find the root cause with access to the source code.

If you do file an issue, mention this thread and my name to get it routed to me.

Can you profile your real application?  The easiest way to get a quick profile is with

I_MPI_STATS=ipm

You can also get a trace of your application with Intel® Trace Analyzer and Collector.  Our developer would like to get more information about your application to help pinpoint the performance degradation.

Also, if you can provide your source code along with how to build and run it, he can directly work with it himself.

I have run a series of tests with our real application that are very similar to the following configurations for the benchmark program:

constant: total data size (-d) = 24GB; message size (-m) = 128KB; num nodes = 4
variable: number of sender/receiver threads (-t) = A) 2 B) 8 C) 16 D) 24

The times that I'm showing are for a particular sender and a particular receiver thread, so that's why they do not fully add up to the wallclock time.

A) 2 sender + 2 receiver threads

wallclock: 17.11s

Sender-side processing: 11.60s
MPI_Issend: 1.16s
MPI_Wait on sender side: 0.41s

MPI_Irecv: 0.62s
MPI_Wait on receiver side: 16.25s

---

B) 8 sender + 8 receiver threads

wallclock: 12.70s

Sender-side processing: 4.37s
MPI_Issend: 0.833s
MPI_Wait on sender side: 5.33s

MPI_Irecv: 0.81s
MPI_Wait on receiver side: 11.66s

---

C) 16 sender + 16 receiver threads

wallclock: 21.25s

Sender-side processing: 4.28s
MPI_Issend: 1.958s
MPI_Wait on sender side: 13.08s

MPI_Irecv: 1.87s
MPI_Wait on receiver side: 19.29s

---

D) 24 sender + 24 receiver threads

wallclock: 43.89s

Sender-side processing: 3.79s
MPI_Issend: 19.70s
MPI_Wait on sender side: 18.5s

MPI_Irecv: 13.20s
MPI_Wait on receiver side: 30.54s

Note the huge times for MPI_Issend / MPI_Irecv in the last case. Part of the explanation may be the fact that I've oversubscribed the system by  a factor of 4 at this point, as our nodes only have 12 cores.

However, comparing B) and C), it appears that the difference comes from MPI_Wait(), which seems to be twice slower for C). This is the most puzzling thing in my opinion. Do you have any explanation for this?

Attachments: 

Leave a Comment

Please sign in to add a comment. Not a member? Join today