mixing Intel MPI and TBB

mixing Intel MPI and TBB

imagem de Hyokun Y.

I have been using a mixture of MPICH2 and TBB very successfully:
MPICH2 for machine-to-machine communication and TBB for inter-machine thread management.

Now, I am trying the very same code in the system which uses Intel MPI instead of MPICH2,
and I am observing a very odd behavior; some messages sent with MPI_Ssend is not being received
in the destination, and I am wondering whether it is because Intel MPI and TBB does not work well
together.

The following document

http://software.intel.com/en-us/articles/intel-mpi-library-for-linux-pro...

says the environment variable I_MPI_PIN_DOMAIN has to be set properly when 
OpenMP and Intel MPI are used together; when TBB instead of OpenMP is used with
Intel MPI, is there anything I should be careful about? Is this combination
guaranteed to work?

Thanks,
Hyokun Yun 

 

13 posts / 0 new
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
imagem de Hyokun Y.

I have attached a simple test program which mixes TBB with Intel MPI. This worked perfectly fine in the previous cluster which uses MPICH2, but in a new cluster with Intel MPI, some messages are never delivered and thus blocking-send never completes.

Anexos: 

AnexoTamanho
Download tbb-test.cpp3 KB
imagem de Hyokun Y.

#include <iostream>
#include <utility>
#include "tbb/tbb.h"
#include "tbb/scalable_allocator.h"
#include "tbb/tick_count.h"
#include "tbb/spin_mutex.h"
#include "tbb/concurrent_queue.h"
#include "tbb/pipeline.h"
#include "tbb/compat/thread"
#include <boost/format.hpp>
using namespace std;
using namespace tbb;
int main(int argc, char **argv) {
// initialize TBB
 tbb::task_scheduler_init init();
// initialize MPI
 int numtasks, rank, hostname_len;
 char hostname[MPI_MAX_PROCESSOR_NAME];
int mpi_thread_provided;
 MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &mpi_thread_provided);
if (mpi_thread_provided != MPI_THREAD_MULTIPLE) {
 cerr << "MPI multiple thread not provided!!! " << mpi_thread_provided << " " << MPI_THREAD_MULTIPLE << endl;
 return 1;
 }
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
 MPI_Get_processor_name(hostname, &hostname_len);
cout << boost::format("processor name: %s, number of tasks: %d, rank: %dn")
 % hostname % numtasks % rank;
 // run program for 10 seconds
 double RUN_SEC = 10;
 // size of message
 int MBUFSIZ = 100;
tick_count start_time = tick_count::now();
// receive thread: keep receiving messages from any sources
 thread receive_thread([&]() {
 int monitor_num = 0;
 double elapsed_seconds;
int data_done;
 MPI_Status data_status;
 MPI_Request data_request;
char recvbuf[MBUFSIZ];
MPI_Irecv(recvbuf, MBUFSIZ, MPI_CHAR,
 MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &data_request);
while(true) {
 elapsed_seconds = (tbb::tick_count::now() - start_time).seconds();
if (monitor_num < elapsed_seconds + 0.5) {
 cout << "rank: " << rank << ", receive thread alive" << endl;
 monitor_num++;
 }
if (elapsed_seconds > RUN_SEC + 5.0) {
 break;
 }
MPI_Test(&data_request, &data_done, &data_status);
 if (true == data_done) {
 cout << "rank: " << rank << ", message received!" << endl;
 MPI_Irecv(recvbuf, MBUFSIZ, MPI_CHAR,
 MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &data_request);
}
}
MPI_Cancel(&data_request);
cout << "rank: " << rank << ", recv thread dying!" << endl;
return;
 });
// send thread: send one (meaningless) message to (rank + 1) every second
 thread send_thread([&]() {
 int monitor_num = 0;
 double elapsed_seconds;
char sendbuf[MBUFSIZ];
 fill_n(sendbuf, MBUFSIZ, 0);
while (true) {
 elapsed_seconds = (tbb::tick_count::now() - start_time).seconds();
if (monitor_num < elapsed_seconds) {
 cout << "rank: " << rank << ", start sending message" << endl;
 monitor_num++;
MPI_Ssend(sendbuf, MBUFSIZ, MPI_CHAR,
 (rank + 1) % numtasks, 1, MPI_COMM_WORLD);
cout << "rank: " << rank << ", send successfully done!" << endl;
}
if (elapsed_seconds > RUN_SEC) {
 break;
 }
 }
cout << "rank: " << rank << ", send thread dying!" << endl;
return;
 });
receive_thread.join();
 send_thread.join();
return 0;
}

imagem de Tim Prince

Did you take care to adjust the environment settings according to your intended method of scheduling?  I'm guessing with MPICH2 you have have left scheduling to the OS.  If using MPI_THREAD_FUNNELED mode, you can easily set the Intel environment variable to get multiple hardware threads per rank, not relying on MPI to understand tbb as it does OpenMP.  I believe then you can explicitly affinitize the tbb threads to that rank, but I don't know the details.  I suppose if you have forced threads from different ranks to use the same hardware resources, deadlock should not be a surprise.

imagem de Hyokun Y.

Thank you very much for the reponse!

Quote:

TimP (Intel) wrote:

If using MPI_THREAD_FUNNELED mode, you can easily set the Intel environment variable to get multiple hardware threads per rank, not relying on MPI to understand tbb as it does OpenMP.  I believe then you can explicitly affinitize the tbb threads to that rank, but I don't know the details.  

I am using MPI_THREAD_MULTIPLE, but I guess what you are saying is still relevant? I tried I_MPI_PIN=off, but it did not help.

Quote:

TimP (Intel) wrote:

I suppose if you have forced threads from different ranks to use the same hardware resources, deadlock should not be a surprise. 

Actually I am using a linux cluster and every rank is assigned to a different machine. But do you still think for specific setting of environment variables deadlock would happen? Note that in the example above, I am using only 2 threads which does MPI_Ssend and MPI_Recv respectively. Therefore there are at maximum only 4 possible threads running - i have 16 cores -, so I thought I had enough amount of resource.

I have just implemented OpenMP version of the code, and I am experiencing the same problem; it works fine with MPICH2 but not with Intel MPI (I can share the code if anyone wants). I tried I_MPI_PIN_DOMAIN=omp and I_MPI_PIN=off, but it did not help. Is there any other environment variable to adjust? Any comments are appreciated! 

imagem de James Tullos (Intel)

Hi Hyokun,

I compiled and ran your test program here using Intel® MPI Library Version 4.1.0.030, and Intel® Threading Building Blocks Version 4.1.3.163.  I am not seeing any deadlocks, with all settings at default.  What is the output you get from

env | grep I_MPI

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

imagem de Hyokun Y.

Hi James, 

Thanks for the response! Does my program terminate properly? On several clusters I tried, it does not terminate since only one thread activates at a time. Note that (approximately) 10 messages have to be sent by each machine at the end of the execution.

By default I have only 

I_MPI_FABRICS=shm:tcp

I tried I_MPI_PIN=off but it did not help. I am using impi/4.1.0.024/ and composer_xe_2013.3.163 (icpc 13.1.1.163).

Thanks,
Hyokun Yun 

imagem de James Tullos (Intel)

Hi Hyokun,

How many nodes are you using?  I was only using two nodes, and testing up to 64 ranks per node (in case oversubscribing led to the problem).  I'm going to try with TCP and see if that causes it to hang.  I'm also going up to 8 nodes.

As far as I can tell, your program completed successfully.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

imagem de Hyokun Y.

James, thanks for taking care of this seriously.

Quote:

James Tullos (Intel) wrote:

How many nodes are you using?

I tried using 2 nodes and 4 nodes.

I reproduced this problem in two different clusters in different institutions, so I think this is not the hardware-specific issue. Would you please let me know which version of Intel MPI and TBB you are using? (probably the most recent?)

From another post on TBB forum, another person was able to reproduce the problem, and he told me there is a issue on multi-threading of Intel MPI: http://software.intel.com/en-us/forums/topic/392226 Do you happen to know anything about this issue? I was wondering whether you were using the fixed version.

Thanks,
Hyokun Yun 

imagem de James Tullos (Intel)

Hi Hyokun,

Let me check with Roman to find out which fix he is talking about, and I'll test with that build.

In my testing, I try to stick to versions that are publicly released, so as to more accurately reproduce what a customer would see.  In this case, I am using IMPI 4.1.0.030 and the latest TBB released with the 2013.3 Composer XE.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

imagem de James Tullos (Intel)

Hi Hyokun,

Please try upgrading to Intel® MPI Library 4.1.0.030.  This issue should correct the problem you are seeing.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

imagem de Hyokun Y.

Switching from 4.1.0.024 to 4.1.0.030 indeed fixed the problem. Thanks very much!

Best,
Hyokun Yun 

imagem de James Tullos (Intel)

Hi Hyokun,

Great!  I'm glad it's working now.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Faça login para deixar um comentário.