Is it possible for the CPU core that offloaded the computation to MIC to work concurrently with MIC ?

Is it possible for the CPU core that offloaded the computation to MIC to work concurrently with MIC ?

Hi All,

I am trying to use both CPU and MIC in parallel to compute a function.

I know about asynchonous data transfer. But I did not come accross any example of asynchonous computation.

For example is it possible to do something like this:

#pragma offload_transfer target(mic:0) wait (UU, VV) out(XX: alloc_if(0) free_if(0)) signal (XX)

{

            nm=(BASEMIC<<1);
            cilk_spawn FuncDM(nm, 0,0, 0,0 , 0,0, 0,0);

            cilk_spawn FuncDM(nm, 0,  nm, 0,0 , 0 ,nm, 0, 0 );

            FuncDM( nm, 0 +nm, 0, nm, 0 , 0, 0 , 0, 0);

            cilk_sync;

}
register int nn=(n<<1);
FuncD(nn, xi + nn , xj + nn ,ui +nn, uj, vi ,vj + nn , wi,wj);
#pragma offload_wait target(mic:0) wait (XX)

If anyone knows about this, please share with me.

Thanks in advance.

Best Regards,

Jesmin

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Asynchronous computation is possible on the Intel Xeon Phi coprocessor. If you add a signal clause to the offload pragma, the pragma becomes non-blocking (asynchronous). In which case, you can later synchronize using the wait clause in the offload pragma. Please refer to the compiler reference for more information: http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-lin/GUID-EAB414FD-40C6-4054-B094-0BA70824E2A2.htm

Another note, you cannot use offload_transfer pragma for computation and it is to be used strictly for data transfers or memory allocation/free. I hope this answers your question. 

Thanks a lot Sumedh. Thanks for clearing confusion.

Best Regards,

Jesmin

MPI is frequently used for parallel computation on MIC and host.  I don't know whether this will remain a useful choice if 12 core host CPUs are released and the BIOS for those doesn't support MIC.

A suggestion I heard today would be to submit the offload in an omp parallel task, running in parallel with a task ruuning on host. 

Or consider host run omp parallel sections with one section submitting the offload, and the other section(s) performing work on host.

Same concept as what TimP suggest.

This is a questions I do not know, maybe TimP can answer:

When using via omp task or section is offload to MIC required to be run form main thread (team member 0 of outer most nest)?
If not (i.e. any omp thread can offload to MIC), then can multiple concurrent offloads be issued?

Jim Dempsey

www.quickthreadprogramming.com

It's possible to issue multiple concurrent offloads.  When you do that, the offloads will conflict in their core assignments, unless you arrange that each gets a KMP_PLACE_THREADS assignment to its own group of cores.   I've done this under MPI, giving each rank a seperate offload KMP_PLACE_THREADS environment setting.  I'm told this is written up in the Reinders book which I haven't seen yet.  If you are doing it all within OpenMP, I suppose you must use set_affinity calls inside the application.  The KMP_AFFINITY machinery may not be up to the job of controlling nested OpenMP, either on the host, or as just suggested, with offload OpenMP inside host OpenMP.

The desire to do this comes up for example with MKL offload, where the individual MKL jobs aren't big enough to need all the cores.  Unfortunately, this may mean that the ratio of offloaded computation to data transfer is unfavorable.  Still, it is possible in principle for example to offload several DGEMM jobs in parallel which are big enough to get signficant benefit.  This also could produce a poor man's way of overlapping data transfer and computation, without the complication of assigning a smaller number of cores to carry out the data transfer and then putting a larger number of cores to work on the computation.

what Timp says is correct

Leave a Comment

Please sign in to add a comment. Not a member? Join today