Host shared memory?

Host shared memory?

Hi,

One application I have requires the solution of a complex PDE that uses significant amounts of memory (can be up to 1GB).  On standard MPI clusters, the large memory requirement is not a problem, as nodes typically have many GB of memory so I can run as as many 1GB processes as there are cores on an HPC cluster.  However, the Intel 5110P Phi card, which functions as a single MPI node, only as ~8 GB of memory, so I can't run 60 processes each at 1GB.  

Q: Can the host share memory that the phi card can access?  (and I'd like to do this using Fortran).  I've seen various compiler directives and pragmas that indicate some sharing, but I have yet to see what I need (I've been busy getting the Phi card up and running, which it is now).

BTW, I see no way to search this forum, as I would think this question would have already been answered.

Thanks!

-joe

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The high ratio of hardware threads supported to RAM is among the reasons for using MPI_thread_funneled mode, preferably with not more than 4 or 5 ranks on coprocessor.  As server CPUs grow to typically 12 cores or more, even with relatively large RAM, the similar thing happens, that best efficiency comes with more threads than MPI ranks, at least when applications can be written with cache data sharing.

The simulated shared memory schemes offered for C offload model aren't extended to Fortran, in part due to better efficiency of MPI.

Among the ways to search software.intel.com is with any of the usual search engines, your_search_string site: software.intel.com

Thanks Tim!

I was thinking about adding some OpenMP directives that could be run on the coprocessor, but I didn't know specifically about MPI_thread_funneled, so I'll search into that, so thanks.

Ya, I could do the site: search, but my focus is purely on MIC right now (software.intel.com is rather large), and the other forums hosted here support forum specific searches, but no big deal.  I'm finding more than enough to read :)

Some influential customers insist that you should be able to run MPI_thread_funneled mode (such as OpenMP inside MPI processes) without the corresponding change from MPI_init to MPI_init_threads.  In my experience, this has worked with Intel and other widely used MPI implementations, although our MPI expert colleagues point out the standard doesn't guarantee it.  It's an effective way to use the coprocessor, no matter whether you take care to make the MPI_init_threads change.

Unlike other MPI implementations, Intel MPI sets affinity by default for OpenMP under MPI, so you need only set OMP_NUM_THREADS (counted per MPI process) and number of processes.  OMP_NUM_THREADS evidently needs to be larger on coprocessor than on host.

Thanks again Tim!  

Since there are four hardware threads per core on the coprocessor, and 60 cores on the 5110, should one shoot for ~240 threads per card?  When I did a simple test using MPI, I did not see very good scaling running 240 processes compared to running just 60. However, I get the impression that hardware threads are more important to utilize than hyperthreads, which have no advantage with MPI, at least in the way I implement it.  I suspect I will just need to test out the particulars of my implementation.

MIC logical threads (4 per core) aren't entirely analogous to host hyperthreads.  Many applications which run best on host with HT disabled can scale to 3 or 4 threads per core, provided that you don't run into cache capacity problems.  If you have private arrays, the stack and L2 cache usage can grow significantly with additional threads, putting a premium on tiling for better L2 locality.  You do get automatic advantage of more copies of L1 by using more threads.  So it is important to try out several choices of omp_num_threads and use micsmc core thread view to help visualize whether the work is balanced.

In my tests on 61-core KNC, cases with 2 processes of 120 threads, 3x80,  4x45 or 4x60, 5x36 have worked well.  I don't know why increasing number of MPI processes has caused performance with 4 threads per core to drop off.  Even with the same application and different data sets, the best number of threads varies.

At least 2 threads per core are required to get 90% use of floating point units on MIC.  Note the compiler option -opt-threads-per-core=3 which is likely to be needed for good performance with a range of threads per core choices.  Since that option was added, the default is 4, which maximizes the ratio of 4 thread performance to 3 thread performance.

As core 0 is likely to get significant addtional load from MPI and OS, (and performance profilers), you will need to try choices of omp_num_threads and -np which don't use up all the cores.  Some applications may run best using 56 cores, for example.  This would be a reason for the existence of 57- and 61-core versions of KNC.  KMP_AFFINITY=balanced (or scatter) may help in spreading threads evenly across the cores assigned to each process, so that you can evaluate the performance of 2 or 3 threads per core.

Leave a Comment

Please sign in to add a comment. Not a member? Join today