Vtune performance analysis results on Phi

Vtune performance analysis results on Phi

Hello,

After some efforts, I have managed to port my simulation code to the Xeon Phi. It is usually runs with MPI on a cluster of PCs, but I wanted to see what kind of performance I would get from the Phi. The code relies quite a lot on MKL routines (Scalapack, MKL implementation of FFTW, a few GSL routines and so on).

I first ran the code as single threaded MPI (as I do on my cluster) with 10-60 MPI processes. I then tried to improve performance by adding threading through compiler option (my thinking was that although my code itself isn't threaded, the MKL routines might benefit from running multiple threads inside an MPI process and that the compiler could also add some paralellism). This brought some performance gain.

I am still however far from the speed I get on the cluster. I then ran Vtune Amplifier, to see if it could help me find out where the bottlenecks are. I would have thought that some of my functions would appear as the culprits, and that I could start improving from there. But no, the main bottlenecks are the MPI and threading libraries, vmlinux and mkl_core (see attached screen capture). I have tried to play with affinity and such, but it hasn't brought me much. I did optimize the ratio of threads and MPI processes - and that improved the performance somewhat.

So what does this mean ? Are the MPI calls not very efficient, and I should try to replace MPI calls by threading or re-think my paralellization scheme? How do I figure out which MPI calls are the most time consuming ? Or should I just concentrate on the functions which appear below (like phase2psfcube_float_function), and improving on them will also have on impact on the library calls above  ?

Thanks in advance for your help and ideas !

Miska

AttachmentSize
Downloadimage/gif vtune-closeloop.gif213.28 KB
23 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Yes, the the trace you provided does show activity dominated by OpenMP and MPI libraries.  Some of your routines may actually rank among the hottest functions, but because VTune Amplifier didn't have access to the available search libraries, it lumps all the counts for such modules together and the accumulated modules often rest at the top of the hot spot list.  Adding additional library paths to the Search Directories in VTune Amplifier will enable it to dissect those library calls into their appropriate functions, rank those functions in the proper order w.r.t. your functions, and also maybe give some hints about what your code is doing, deduced indirectly from the function names that pop up.  Perhaps you could try adding some of the following?

/lib/firmware/mic                                               for vmlinux functions

/usr/linux-k1om-4.7/linux-k1om/lib64                  for libc and other generic user libraries

/opt/intel/composer_xe_2013/lib/mic                  for Intel OpenMP, Intel Cilk and other compiler-related libraries

/opt/intel/composer_xe_2013/mkl/lib/mic            for Intel Math Kernel Library functions

/opt/intel/composer_xe_2013/mpi/lib/mic            I don't know if this one actually works--MPI is not installed on this machine--but

                                                                      this would be the logical place for it to appear.

It takes a lot of work for most applications to find a good balance of numbers of threads and MPI ranks.  Optimum may be fewer ranks of more threads than you might guess; e.g. 4 MPI processes of 45 or 60 threads.

When using MKL threading, it's important to set OMP_NUM_THREADS or MKL_NUM_THREADS within the number of hardware threads available per MPI process.  If using more than 1 core per MPI process, with less than 4 threads per core, it's important to set environment variables to spread the threads evenly among the cores of each rank.

Looking at the events 1 thread at a time will give you a better idea of the impact on elapsed time (of that thread) than looking at the total events of all threads.  In my experience, the total OpenMP events are misleadingly high, but the MPI events (with a small number of ranks) may be smaller than their actual impact. 

To see the MPI events labeled by function,  add the host directory of your mic .so libraries under the MPI installation to the list of search directories which Robert quoted.

Hello,

Thank you both of you for the ideas and comments.

I first added the suggested directories into the vTune search path. I does light up some MPI and OpenMP calls. Just not sure if I understand more what's going on. I suppose that from the function names, I can guess that there are shared memory congestions, and that creating the OpenMP threads creates quite an overhead. But is there a way to see where in my code the calls to MPID_nem_sshm_poll come from ? Or is this simply a general MPI function that basically happens at each MPI communication ?

Any other ideas on how I may "guess" where I should start the optimization process ? It would be really helpful to know for example which are the MPI call which take the most time, so I could start by optimizing those.

As to the placement on threads and MPI rangs, I did play with the following parameters:

export MIC_ENV_PREFIX=PHI
export PHI_KMP_AFFINITY=fine,balanced
export PHI_KMP_PLACE_THREADS=60c,4t
export PHI_OMP_NUM_THREADS=160

I tried in the affinity: (fine,compact), (fine,scatter), (fine,balanced). In the place threads, I tried from (60c,4t) down to (60c,1t). In the number of OMP  threads, I went from 20 to 240.  The configuration above is pretty close to the optimal, using 40 MPI ranks. Do you think this covers the parameter space ?

I am now trying to get the HugePages hack to work, to see if that could help me get more performance, but I think I am now close to where I need to start changing things in my code. For the moment, I have tried to improve performance by "external" means (threading, number and placement of threads etc.)

Thanks again for you help.

Miska

Attachments: 

AttachmentSize
Downloadimage/gif vtune2.gif207.45 KB

Ok, something's fishy in my OMP settings. Just after I finished writing the previous message, I noticed that *actually* my tests with the number of threads was:

export OMP_NUM_THREADS=4
export MIC_ENV_PREFIX=PHI
export PHI_KMP_AFFINITY=fine,balanced
export PHI_KMP_PLACE_THREADS=60c,4t
export PHI_OMP_NUM_THREADS=160
mpiexec.hydra -n 40 -hostfile cluster_file_mic.txt ./close_loop > close1.txt

Now, I had somehow understood that
export OMP_NUM_THREADS=4
was the number of threads per MPI rank
and that:
export PHI_OMP_NUM_THREADS=160
Was the total number of threads (so 40 MPI ranks * 4 threads / rank = 160). But now I am not sure anymore.

So I am not sure I understand at all how the OMP distributes threads in an MPI environment...

Can someone shed some light on this ? Most of the examples I see are pure OpenMP, so I guess my confusion comes from trying to mix MPI with OpenMP.

Thanks !

Miska

I don't think you can use KMP_PLACE_THREADS this way; I would not use PLACE_THREADS for MPI ranks running on coprocessor.

 KMP_AFFINITY=balanced should work.

If your MPI processes are running on the host, using offload to run on the coprocessor, you must list each rank separately with the number of cores and offset, e.g. :

-env PHI_KMP_PLACE_THREADS=15C,45t,0O

-env PHI_KMP_PLACE_THREADS=15C,45t,15O

..

That is, you specify for each rank which cores it will use on the coproceessor.

Apparently, the only place this is documented is in the book by Reinders et al. which is for sale.

Ok. One thing I don't understand though. It seems I get a different behaviour, if I do:

export OMP_NUM_THREADS=4
export MIC_ENV_PREFIX=PHI
export PHI_KMP_AFFINITY=fine,balanced
export PHI_KMP_PLACE_THREADS=60c,4t
export PHI_OMP_NUM_THREADS=160
mpiexec.hydra -n 40 -hostfile cluster_file_mic.txt ./close_loop > close1_1jobs.txt

(which works fine, best performance I get) and if I do (after having unset the above variables):

export MIC_ENV_PREFIX=PHI
export PHI_KMP_AFFINITY=fine,balanced
export PHI_KMP_PLACE_THREADS=60c,4t
export PHI_OMP_NUM_THREADS=160
mpiexec.hydra -n 40 -hostfile cluster_file_mic.txt ./close_loop > close1_1jobs.txt

So omitting the first variable: OMP_NUM_THREADS. In that case the code is very slow (I even stopped it at some point, it seemed it was hanging - or just excrutiatingly slow)...

Is this "normal" ? I thought that maybe the PHI_OMP_NUM_THREADS is just a specific command for the Phi of OMP_NUM_THREADS. But no ? They have different meanings ?

Sorry to be so confused, but it seems MPI and OMP together is not the main mode of operation...

Thanks so much for your help, and have a nice week-end,

Miska

As I tried to say, you should tell us whether you are running MPI on the host only and accessing the coprocessor by offload only, as your use of MIC_ENV_PREFIX implies.  We can get into more alternatives than you want to know about.

Did you use micsmc to get a picture of how the cores are used on your coprocessor?

Quote:

As I tried to say, you should tell us whether you are running MPI on the host only and accessing the coprocessor by offload only, as your use of MIC_ENV_PREFIX implies.  We can get into more alternatives than you want to know about.

Sorry, I didn't understand. I am running 100% on the Phi. So I launch the mpiexec on the host, but the machinefile only contains the card.

I see I understand even less than I though, since I naively assumed MIC_ENV_PREFIX was something different. However, if I ssh to mic0, and do a top, I do see my job appearing, I see that when threading is enabled, each process is taking more than 100%. I also don't see in top on the local host any job appearing. But I definitely want to run my code 100% on the Phi, no offloading at all...

This is a complicated beast...

Quote:

Did you use micsmc to get a picture of how the cores are used on your coprocessor?

Yes. The load I see does get significantly higher, when I use more threads / ranks. However, the correlation between high load and good speed is not necessarily obvious (although it's there). I can use a lot of cores, but still get bad performance in terms of execution time. I suppose that it's the MPI / threading overhead...

Using top, I can also see that sometimes a process just takes 0.1% (for a short time) and then can jump to almost 300% or more. So I think I need to understand what phase of the code causes the idle times. The problem is that they are quite short. Probably understanting better the output of vTune can help me there too.

I hope this clears things up ? I am actually quite confused :-)

Miska

Just to add... I guess my questions are now more general.

- How do I control how many threads each MPI rank is allowed to use ?

- How do I tell how the ranks / threads are physically placed on the card ? This also implies (I assume) some level of tweaking, so what are the variables that I need to iterate on to be fairly sure that my configuration is optimal.

I thought I had that figured out. But I now think I might have it all wrong...

Miska

The environment variables you set by MIC_ENV_PREFIX are ignored for MPI ranks which run entirely on the coprocessor.  Here, you want the environment variables without prefix, communicated either by setting them as -env options in mpiexec.hydra, or by putting them in a script you launch on the coprocessor for each rank from mpiexec.hydra, which in turn launches the application for that rank.  In either case, you would specify OMP_NUM_THREADS to be the number of threads per rank.  Intel MPI will divide the cores as evenly as possible among the ranks by default.

Hello,

Ok, I have tested further, and just playing with the number of ranks and threads, cleaning up my environment variables of the offload-related commands. The optimum seems to be ~20 ranks, 4 threads. In that case, my test case runs in 5minutes 21s. The worst result is 120 ranks, 2 threads (7:28). However, for example, 40 ranks, 6 threads gives 5:44 (and the micsmc GUI shows a load of around 80%). So I am not sure I see a 100% correlation between MIC load and execution time. I am a bit surprised that the spread in running time is not larger. I suppose it just indicates that I am limited by the OpenMP / MPI overheads.

I have attached here the results of micsmc shortly after the code has started. This is the "optimal" configuration.

I have also attached two screen grabs of the vTune output. One with 20 ranks, 4 threads, the other one with 20 ranks and 8 threads, just to see the difference - I don't know if this can help determine how to progress from here.

Another path I have tried, is to run the code only on the host, and use loopprofileviewer.sh to analyze the loop profiles. The results kinda make sense (when loopprofileviewer  works - the XML output  seems to get confused by the use of MPI), as the same functions in my code appear both in the vTune results and as main time hogs in loopprofileviewer. Of course loopprofileviewer only shows my own functions - not all the system stuff.

Any more ideas on where to start to tinker the code to get improvements ? I suspect I have exhausted the "external" possibilities (MPI ranks, threads, compiler options,...). Is the best strategy just to just start with the regions of my code showing up in vTune ?

Thanks so much for your help and insight...

Miska

Attachments: 

I'd have guessed that 8 threads per rank might run better at 14 or 15 ranks. Did you open up the micsmc view by core to see how well balanced the work is across cores (open up card 0 and click on the little icon in the upper right corner of that display)?

You do appear to have so much MPI overhead that would be the first thing to improve.  None of my applications have run well with more than 5 ranks on the coprocessor.

Here is an example of the core load. It seems relatively well balanced, but there is also obvious structure.

Ok, I will have a look at reducing the MPI overhead. The slightly surprising thing is that reducing the number of MPI threads from 20 to 10 for example (and correspondingly increase the number of threads) doesn't improve performance that much.

Attachments: 

AttachmentSize
Downloadimage/gif load-mic0-20ranks-4threads.gif75.53 KB

Here another example, with 10 ranks, and 20 threads. The load goes higher, but the code doesn't execute any faster :-(

Attachments: 

AttachmentSize
Downloadimage/gif load-mic0-10ranks-20threads.gif79.04 KB

Ideally, these charts would show the same number of threads active in each core.  You show a majority of the cores running 1 thread each, but a significant number running 2 threads.  This will give you less performance than if all cores were running either 1 or 2 threads,  unless you have managed to adjust the work per thread just right so that the single threads perform more work than an individual thread of the paired threads, but less work than the total assigned to each pair.

In a relatively efficient demonstration, we have 4 ranks each running 45 or 60 threads, all threads getting the same amount of work, with all bars in the chart close to the same height.

With VTune itself, you could examine individual threads and see how much time they are stalled with OpenMP waits on account of work imbalance.  I haven't had success with fancier tools for analyzing these data.

Ok, thanks - I will try more balanced configurations, and also a Vtunes analysis looking at a specific thread.

A few more questions:

- Does the Intel MPI implementation on the Phi use busy waits ? So while the code is waiting for MPI (or OpenMP) communications to happen, does the process show as active or not ? Reading your description, it sounds like waiting doesn't show as "load".

- I am still trying to understand if I need to concentrate on a few areas in my code (basically those lighting up in the vTune general analysis), or if the problem is deeper than that (i.e. all MPI calls are evil, and should all be replaced by threading). Is there any way to figure this out from the vTune analysis ? So is there a way to find out if the delays are peppered throughout the code or wheather a targetting intervention is sufficient ?

My problem is that currently, the code runs at least 5 times faster on a (relatively fast and expensive) Xeon multi-CPU machine than on the Phi. And since I don't have so many resources to spend on rewriting major parts of the code (I could still do a few tune ups here and there, but anything major seems unrealistic), I am pondering how much effort I should still spend on trying to get the Phi to perform better (versus just sticking with the CPU / cluster approach, and possibly getting a student to explore the Phi further a some later point).

Thanks !

Miska

OpenMP has a default busy wait loop time, after which the thread sleeps and checks less frequently for wakeup.  The usual linux sched_yield function is used.

MPI latencies are significantly higher in real time (perhaps not in clock cycles) than on the host.  In addition, multiple MPI ranks per core aren't useful.  So there is a dependency on combination of MPI and OpenMP for full performance, as you already seemed to figure out.

Successful cases of running on host and coprocessor simultaneously by MPI tend to involve closer to equal performance between host and coprocessor than you quote.  So, one of the issues to work on is where locally your application doesn't show sufficient performance on coprocessor, e.g. by insufficient vectorization or threaded parallel scaling.  Either of those might come about from problem sizes departing from ideal (largest arrays which fit on coprocessor).

Ok, thanks. So I will try to progress on 2 fronts:

- Seeing which MPI calls take the most time. I will use vTune for this

- Seeing where I could benefit from vectorization. This seems easier, as the vTune and loopprofiler point towards the same functions...

Final (?) question: does Intel (or someone else) organize formal trainings on precisely these kinds of topics (optimizing one's code for the Phi, with of course also information on the internal workings of the card) ? I would prefer something hands-on (as opposed to webinars, training videos etc), and am willing to travel (but having something in Germany could also be interesting).

Thanks for your time,

Miska

By the way, Intel MPI also has some busy wait loops with the time up to sched_yield adjustable by environment variable.  If you are successful in profling by the itac profiler, you will capture all these delays with the assistance of the mpi_dbg libraries.  For certain kinds of profiling outside of itac, it's useful to increase the delays so as to see all the delay as a busy loop.

The training schedule is heavy on webinars, and videos prepared for them.

Check the agenda for ISC conference.

Argh ! Every time I think I have all the elements in hand, something new pops up. So Intel trace analyzer and collector. I didn't even know the existance of this tool before you mentionned it. Shows how much I still have to learn...

So in my application mixing everything: threads, MPI, Phi, what does the Itac Profiler (which it seems comes with the cluster toolkit, which I purchased) bring compared to vTune (I am running now in the 30 day trial period) ? Is the itac profiler the MPI-compatible version of loopprofiler.sh ? Are the three tools complementary to each other, or is there a hierarchy between them (i.e. vTune does everything, and if you don't have it, parts can be done with Itac, or does every tool bring something specific - and then how to best combine their informations ) ?

I completely under-estimated the difference between gcc / MPICH / CPU and  icc / intel MPI / Phi. The number of tools and options and possibilities provided by the Intel-world is staggering, but also confusing. I have difficulties to see how the different tools fit together (reading manuals of each of them individually is ok). But I guess I am looking for trouble with my code, which happily mixes all the possible difficulties...

I will have a look at the ISC conference agenda.

Miska

Hi Miska,

Though loopprofiler and VTune overlap, ITAC has capability (focused upon MPI) that neither loopprofiler nor VTune have.

VTune provides all the information that loopprofiler does. The argument for using loopprofiler is two fold. It comes with the compiler, is simplier to use, and is more focused on what it does than is VTune. VTune has over an order of magnitude greater capability than does loopprofiler, but sometimes all that capability just gets in the way for your initial analysis.

For more information on ITAC, see http://software.intel.com/en-us/intel-trace-analyzer.

Regards
--
Taylor

Thanks for the clarification Taylor !

Leave a Comment

Please sign in to add a comment. Not a member? Join today