Problem Using Intel MPI with distributed CnC

Problem Using Intel MPI with distributed CnC

I am using intel mpirun to run the distributed programs I observe the following strange behavior, I run a starpu distributed code on the cluster and it runs totally fine. I run a cnc distributed code I get infiniband errors. 

This command runs fine

mpirun -n 8 -ppn 1 -hostfile ~/hosts -genv I_MPI_DEBUG 5  ./dist_floyd_starpu.exe -n 10 -b 1 -starpu_dist

This command generates spew

mpirun -n 8 -ppn 1 -hostfile ~/hosts -genv I_MPI_DEBUG 5  -genv DIST_CNC=MPI ./dist_floyd_cnc.exe -n 10 -b 1 -cnc_dist

[0] MPI startup(): Intel(R) MPI Library, Version 4.1.0 Build 20130116
[0] MPI startup(): Copyright (C) 2003-2013 Intel Corporation. All rights reserved.
[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[2] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[3] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[4] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[3] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[4] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[6] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[6] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[1] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[0] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[5] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[5] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[7] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[7] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
node3.local:7764: open_hca: device mlx4_0 not found
[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
[2] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
node3.local:7764: open_hca: device mlx4_0 not found
node7.local:7c3a: open_hca: device mlx4_0 not found
[6] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[2] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[6] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
node7.local:7c3a: open_hca: device mlx4_0 not found
[6] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[6] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
node6.local:5f98: open_hca: device mlx4_0 not found
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
[5] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
[5] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
CMA: unable to open /dev/infiniband/rdma_cm
node4.local:75c: open_hca: device mlx4_0 not found
node5.local:1ce2: open_hca: device mlx4_0 not found
librdmacm: couldn't read ABI version.
node6.local:5f98: open_hca: device mlx4_0 not found

Both executables link to the same libraries

ldd dist_floyd_cnc.exe
libcnc.so => /opt/intel/cnc/0.7/lib/intel64/libcnc.so (0x00002b4292bc0000)
libtbb.so.2 => /home1/intel/composer_xe_2013.1.117/tbb/lib/intel64/libtbb.so.2 (0x00002b4292cff000)
libtbbmalloc.so.2 => /home1/intel/composer_xe_2013.1.117/tbb/lib/intel64/libtbbmalloc.so.2 (0x00002b4292e4b000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00000039e7800000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00002b4292fbe000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000003550200000)
libibumad.so.2 => /usr/lib64/libibumad.so.2 (0x0000003550e00000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003550600000)
librt.so.1 => /lib64/librt.so.1 (0x0000003552200000)
libm.so.6 => /lib64/libm.so.6 (0x00002b42931c8000)
libiomp5.so => /home1/intel/composer_xe_2013.1.117/compiler/lib/intel64/libiomp5.so (0x00002b429344b000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003559800000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003555e00000)
libc.so.6 => /lib64/libc.so.6 (0x000000354fe00000)
/lib64/ld-linux-x86-64.so.2 (0x000000354fa00000)

ldd dist_floyd_starpu.exe
libstarpu-1.0.so.1 => /usr/local/lib/libstarpu-1.0.so.1 (0x00002aaef6b6b000)
libhwloc.so.5 => /usr/local/lib/libhwloc.so.5 (0x00002aaef6dea000)
libstarpumpi-1.0.so.1 => /usr/local/lib/libstarpumpi-1.0.so.1 (0x00002aaef7014000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00000039e7800000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00002aaef7252000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000003550200000)
libibumad.so.2 => /usr/lib64/libibumad.so.2 (0x0000003550e00000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003550600000)
librt.so.1 => /lib64/librt.so.1 (0x0000003552200000)
libm.so.6 => /lib64/libm.so.6 (0x00002aaef745c000)
libiomp5.so => /home1/intel/composer_xe_2013.1.117/compiler/lib/intel64/libiomp5.so (0x00002aaef76df000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003559800000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003555e00000)
libc.so.6 => /lib64/libc.so.6 (0x000000354fe00000)
libglpk.so.0 => /usr/lib64/libglpk.so.0 (0x00002aaef79e3000)
libelf.so.1 => /usr/lib64/libelf.so.1 (0x0000003551200000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002aaef7c79000)
libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x000000355cc00000)
libz.so.1 => /usr/lib64/libz.so.1 (0x00002aaef7e7f000)
/lib64/ld-linux-x86-64.so.2 (0x000000354fa00000)

Any help or inputs would be appreciated.

16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,
very strange.

This doesn't look familiar to me, but let's try to find the cause of the issue.

  • Does it work when you use pure sockets (DIST_CNC=SOCKETS, CNC_SOCKET_HOST=...)?
  • Does it work when using TCP over MPI?
    • Use TCP by setting environment variables (at runtime) I_MPI_FABRICS=shm:tcp and I_MPI_TCP_NETMASK=ib and I_MPI_DYNAMIC_CONNECTION=0
  • Does it help if you, at runtime, set environment variables I_MPI_DAPL_TRANSLATION_CACHE=off and I_MPI_DAPL_CREATE_CONN_QUAL=0?
  • Can you run one of the distCnC samples which come with the distro?
  • Can you upgrade to 0.8? If you do, does it still happen?

I am sorry you are having problems. I am confident we can find a fix.

frank

Hi,

Thanks for the reply. I have not found the acutal problem yet but setting I_MPI_FABRICS=ofa works for now. I tried points 2 and 3 but that does not help. I am using the latest CnC 0.8. I am trying to compare runtimes like CnC and StarPU for an article that is being worked on by my group. I have a few questions regarding the performance aspects, is the forum the right place for such questions as well? I want to know if the way I am doing the task distribution etc is the right way of doing this. 

Glad this works with OFA, so you have a working version. I'll try to get some informatino about the error nevertheless.

Sounds like a very interesting project! This is a good forum to ask questions of the kind you mentioned. However, if you have data which you don't want to post in a public forum or you have other preferences, we can also switch to a different medium.

Looking forward to your questions!

frank

One of the benchmarks I am evaluating is floyd warshall. I am doing a 2d block distribution as detailed here http://www.mcs.anl.gov/~itf/dbpp/text/node35.html. I have decent scaling on shared memory for my code and I see slowdows with the distributed version of it. I have specified the tuners compute_on, get_count and consumed_on which tell CnC the communication pattern. I think one of the problems is the hashmap_tuner, I want to use a vector_tuner for my code but I could not find any documentation regarding how to use a vector_tuner. I am uploading the cnc implementation but I can give you access to my bitbucket repository or github repository so that you can run it. Basically the problem I am facing is that the task instertion step is consuming dispropotionate time.

Attachments: 

AttachmentSize
Download floyd-cnc.cpp18.47 KB
Download cnc-common.h3.54 KB

Here are a few things you could try to debug the DAPL issue.

  • The output of ‘ibv_devices’ ‘ibv_devinfo’ may be useful.
  • OFED might not be running. The full debug output (‘-genv I_MPI_DEBUG 6’) of the starpu run might give us a hint.

  • Please check that /etc/dat.conf files the same on all nodes. You can also specify DAPL provider explicitly with I_MPI_DAPL_PROVIDER (i.e. ofa-v2-ib0)

Wow, very cool!

If the shared-mem CnC version scales ok, then it is unlikely that the hashmap is a problem on distributed memory. Is the dist-mem version slower on shared-mem than the shared-mem one?

As a first hint you can use the timing feature to get an impression on the times spent with getting/putting compared to the step-body. The results are produced on each process individually. Use CnC::debug::init_timer(true), CnC::debug::finalize_timer(file) and CnC::debug::time(m_steps) to get the timing information. More details are here: http://software.intel.com/sites/landingpage/icc/api/struct_cn_c_1_1debug...

If it turns out to be an issue, I'd be happy to assist you in getting the vector-tuner running. It's straight forward but only works if you can store the full tag information in a single (long long) integer and if the range of resulting values is dense.

As the code in incomplete, I can't compile it. But I'd be more than interested in playing with it.

I am sort of cheating with the shared memory version, in the sense I am not actually putting the blocks in an item collection and doing gets/ puts on it, I directly access shared memory in the task. Below are the results for shared memory, distributed memory code on shared memory and actual distributed run. The first run shows where CnC beats a simple omp implementation handsomely. The cluster has 32 nodes and 8 cores on each node. The shared memory runs are on a sinlge node. The thing I was happy about was on shared memory CnC shows linear scaling :), I had to fiddle a little with the scheduling options though. To answer your question yes the distributed version is running slower on 8 nodes * 8 cores than a single node 8 cores.

 CNC_SCHEDULER=FIFO_PRIOR_STEAL ./dist_floyd_cnc.exe -n 2048 -b 128 -cnc -seq -omp

Warning: DIST_CNC environment variable not specified; proceeding in shared-memory mode.
Sequential Time 30.283031s seconds
OpenMP Time 11.320426s seconds
[CnC] Using FIFO_PRIOR_STEAL scheduler.
Num Tasks (2048 * 16 * 16) = 524288
Shared CnC Item tag creation time 0.000192s seconds
Shared CnC task creation time 0.984864s seconds
Shared CnC task execution wait time 3.173542s seconds
Shared Memory CnC Time 4.457898s seconds
Done

CNC_SCHEDULER=FIFO_PRIOR_STEAL ./dist_floyd_cnc.exe -n(problem size) 2048 -b(block size) 128 -cnc_dist
Warning: DIST_CNC environment variable not specified; proceeding in shared-memory mode.
[CnC] Using FIFO_PRIOR_STEAL scheduler.
Distributing ROW_CYCLIC
Num Tasks (2048 * 16 * 16) = 524288
Dist CnC tile creation time 0.045088s seconds
Dist CnC task creation time 0.954301s seconds
Dist CnC task execution wait time 6.663991s seconds
Block Collection Size 512
Dist CnC data gather time 0.034785s seconds
Dist CnC cleanup time 0.000001s seconds
Distirbuted CnC Time 7.666560s seconds
Done

mpirun -n 8 -ppn 1 -hostfile ~/hosts_all -env DIST_CNC=MPI -env I_MPI_FABRICS=ofa -env CNC_SCHEDULER=FIFO_PRIOR_STEAL ./dist_floyd_cnc.exe -n 2048 -b 128 -cnc_dist

Num Tasks (2048 * 16 * 16) = 524288
[CnC] Using FIFO_PRIOR_STEAL scheduler.
[CnC] Using FIFO_PRIOR_STEAL scheduler.
[CnC] Using FIFO_PRIOR_STEAL scheduler.
[CnC] Using FIFO_PRIOR_STEAL scheduler.
[CnC] Using FIFO_PRIOR_STEAL scheduler.
[CnC] Using FIFO_PRIOR_STEAL scheduler.
[CnC] Using FIFO_PRIOR_STEAL scheduler.
Dist CnC tile creation time 0.175876s seconds
Dist CnC task creation time 3.844828s seconds
Dist CnC task execution wait time 13.750184s seconds
Block Collection Size 512
Dist CnC data gather time 0.035928s seconds
Dist CnC cleanup time 0.000000s seconds
Distirbuted CnC Time 17.756664s seconds

Do you know that this algorithm actually can scale on distributed memory, e.g. do you have a MPI, say, version of it that performs well?

Performance aside, I think your CnC implementation might be incorrect. Both, the shared and the distributed memory version use tricks to avoid memory allocation for in-place updates. In some case this is a valid thing to do, e.g. if it's semnatically guaranteed that the re-use only occurs on single-use items (more precisely on the last-use). I might be mistaken, but as far as I understand the code and the algorithm, it is overwriting memory which might be used by other steps concurrently (e.g. there is no semantic single/last-use guarantee).The optimizations you did create the difficutlies that CnC tries to protect programmers from :)

If my suspicion is right, I recommend writing a clean version first, and then see where optimization is needed. E.g., your shared-mem version needs a 3d-array to store the intermediate results (and I guess any parallel version does if no explicit synchronization is done) and the tiled version needs to create a fresh tile for every output the step creates. Am I missing something?

Yes I do have a manual MPI implementation of floyd warshall which scales well, but the distribution is not two dimensional. The two dimensional distribution is supposed to do better. I agree shared memory implementation is not excatly what you would expect from a CnC user, but for shared memory I was more interested in the point to point synchronization abilites in a task parallel environment, OpenMP has unnecessay barriers which prevent it from scaling well. But nevertheless it is correct since I put block tags and get them instead of getting complete blocks (you can see this in the code there are gets on block tags before touching the appropriate region) so I have gaurantee that sequential consistency is respected.

As for the distributed memory version I do have complete blocks in the item collection and it is three dimensional Triple(k, i, j) is the (i,j)th block in the kth iteration. In the distributed case I am sticking to what CnC expects the user  to do, let me know if this is not the case.  I did start from realy simple and basic implementation :). 

Attaching the complete implementation. make cnc should work if intel MPI and CnC are installed and the var scripts are sourced. The results can be compared with sequential version by giving the -check option.   

Attachments: 

AttachmentSize
Download floyd.tar.gz6.98 KB

I seem to be missing something.

When the two steps write the new min-value, they use either global memory (sharedmem) or re-use memory that was retrieved through get (distmem). What I don't understand is why no other step can read from the memory that is currently written. What's the semantic which excludes the race? If you have a quick answer, I'd appreciate it. Otherwise I will try to figure it out tomorrow.

Thanks for the code, I'll look at it tomorrow.

frank

Ah I see what you are asking yes that is a little sublte 

for (k = 0; k < n; k ++)

for (i = 0; i < n; i++)

for (j = 0; j <n; j++)

D[i][j] = min(D[i][k] + D[k][j], D[i][j])

Basically the thing to observe is D[i][k] nor D[k][j] will not be written during a particular iteration of k since for D[i][k] to be written D[i][k] + D[k][k] < D[i][k] similarly for D[k][j]. This is not so obvious I think I should avoid using this for distributed memory I will make a new tile and for shared memory I will use stencil like code D[k%2][i][j] to avoid any confusion like this. But thanks for pointing out I just used the code from my prior openMP implementation, it took me quite a bit of time to reason this out again :(.

Ok, I understand. Thanks.

I can run your example on our cluster. No problem with DAPL.

I am excited to see that this example makes use of priorities and they actually help (at least on shared memory)!

With respect to distributed memory, I think the fundamental problem is that row and colum k are are an almost serializing bottleneck but they are accompaigned by additional computation and data. E.g. the entire tile is computed and transfered before dependent tiles can start - even though they depend only on parts of it. I used Intel(R) Trace Analzyer and Collector to vizualize the behavior and that "serializing" effect became very apparent.

Does your MPI implementation actually ship only the row/colum that is needed on the other process?

A solution could be similar to what the (RTM-3dfd) stencil code does: have special data-items which represent the row/colum k-pieces of a tile only and ship/access those only (for the (i,k) and (k,j) values). Unfortunately this is a little more work than I would like it to be.

Does this make any sense to you?

I was happy that the priorites worked out nicely, it also helps on ditributed memory but the basic scaling problems overshadow it :).

You are right the MPI version broadcasts the row required by every node after every iteration of k so it does not do unecessary communication. But my 2d CnC code does unecessary communication, I was avoiding the precise coummuncation version like you said because it will get complicated. But looks like I cannot avoid it anymore, I will try to implement a 1d distribution as well as precise communcation with 2d distribution, this is necessary for the comparision we are doing. The timing statistics for the steps really helped thanks. Btw I only get conext statistics only for node 0 and not other nodes (I mean scheduler statistics, messages sent, recieved etc). Am I missing something here?

I got rid of sequential step tag generation. Each node now generates the steps only it is going to compute, this reduced a lot of messages being sent by node 0 to other nodes to transfer the step tags to appropriate nodes.

Is there a free academic version of Intel(R) Trace Analzyer that I can use? Also for the RTM stencil is there a simple sequential code that you guys used as reference while implementing it in CnC. I want to run the sequential code to report speed ups relative to it.

I looked that statistics issue and found 2 issues:

  1. There is a bug in the CnC runtime which prevents it from printing the remote stats. The fix is already in place and will be part of the next release. If you want a pre-release with that fix, please send me an email (I have sent you my address in a personal message here on IDZ). 
  2. You need to enable statics within the context constructor, otherwise you enable it on the host process only. I might consider allowing it more flexibly, but that's not availble right now.

Trace Analyzer&Collector are part of the Intel(R) Cluster Studio. You can get a free 30-day trial license (http://software.intel.com/en-us/intel-cluster-studio-xe-evaluation-options/) and there is a special program for academia (http://software.intel.com/en-us/intel-education-offerings/).

The serial tag generation is primarily a scalability issue, mostly at start-up time. In the RTM-stencil (3dfd) example there is a pretty generic implementation of a multi-dimensional blocked_range and a partitioner, which works nicely with the tag-range feature. The benefit is that the recursive range-partitioning works parallel and even gets distributed. Instead of putting one tag at a time, one put a full range and let the runtime (or custom paritioner) partition the tag-range (which is treated more like a set than a contiguous range).

I'll get back to you about the RTM-3dfd matter later.

Curious to see how well the new version will perform! Good luck!

Attached a sequential version of the same stencil as shippped with the CnC kit.
We also have a CnC-shared-mem-only version of this if you are interested.

Attachments: 

AttachmentSize
Download sequential-stencil.cpp5.67 KB
Download util-time.h2.18 KB

Login to leave a comment.