difference between omp target and native mode

difference between omp target and native mode

Now that OpenMP 4 is fairly well supported on the Fortran side (except for simd reduction), I've been able to set up an example which can run in host, MIC native, and offload mode, simply by changing compile options.

I've arranged the benchmark to minimize the accounting for data transfers between host and coprocessor by running the test loop thousands of times between transfers, yet the offload performance doesn't approach MIC native performance.  A small part of the problem is that the offload mode peaks at 59 threads (about double the performance of default number of threads), while the native mode shows gains up to 177 threads.

Another small part of the problem is that compiler directives surrounded by #ifdef __MIC__ are used only for -mmic compilation.  I would use them also for offload target mode if I knew the incantation.  Stuff like !dir$ no vector for the case where the vectorization is slow due to ineffective software prefetch, and !dir$ unroll(0).

According to the vecanalysis tool

http://software.intel.com/en-us/articles/vecanalysis-python-script-for-a...

not only are the conditional directives not used for omp target mode, in every test some vector operations which are reported as "lightweight" in the native mode are reported as "medium" for omp target, and some "medium" are promoted to "heavyweight."  Examination of the .s files doesn't show any difference to account for the difference in reports.  The MIC .s files are difficult to read as there appears to be no way to suppress a debug symbol showing prior to each instruction.

Also, the native mode compilation vecanalysis reports no peeled vectorized loops and several vectorized remainder loops, opposite to the omp target mode.  That's another problem which should be only minor.

My C++ version gives similar performance to the Fortran in MIC native mode, but isn't sufficiently stable in omp target mode.  The old problem of reporting buffer overlaps when transferring explicitly more than 64MB remains, among others.  The only suggestion I've received about that is that the current MPSS may not be supporting the earlier KNC coprocessors (apparently all current production models have more than 4GB RAM).  It's strange that the problem is solved for ifort but not for icpc.

There's also a remaining conflict between target map and target update, where both are needed in the same application, which was fixed in ifort.  gcc seems to be copying the lack of support for target update.  I filed a bug report with gcc about omp simd reduction being accepted (even where icc rejects it) but killing the optimization which occurred without the directive, which was verified.

AttachmentSize
Download lcd_omp4.tgz15.19 KB
8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Tim,

I am still reviewing all the details in your post and in the attachment and will comment on that again soon. For now I just wanted to share that the __MIC__ predefine appears active for the offload target mode compilation also from what I see from using that in smaller offload target examples.

export MIC_OMP_SCHEDULE=auto

makes a slight improvement in offload mode.  This is the only case I have seen where auto is best.  Perhaps it's not surprising that offload needs a different schedule as well as a smaller number of threads than MIC native.

I found Intel's description of OMP_SCHEDULE=auto a little confusing, where it says the schedule is to be determined by the compiler.  This originally made me wonder if the same effect is available by schedule(runtime), as the standard would seem to require, and does appear to be the case.  What it is doing seems to be a secret; usually it's not acting much different from guided.

I was wondering if offload mode restricts memory or stack available to the application, and whether there is a way to adjust stack (for example, under MPI we must run a script including ulimit -s unlimited for each rank).  I noticed the documentation about MIC_STACKSIZE; I'm not certain the results are repeatable, but optimum may be around 90M if it's doing anything.

export MIC_OMP_SCHEDULE=auto

makes a slight improvement in offload mode.  This is the only case I have seen where auto is best.  Perhaps it's not surprising that offload needs a different schedule as well as a smaller number of threads than MIC native.

Unless you are using a schedule(runtime) clause in your code, OMP_SCHEDULE will have no effect, so if you're seeing one it's a result of something else, (like random variation).

The offload code automatically sets the thread affinity mask on the card so that when the OpenMP runtime starts it will automagically avoid the core on which the offload daemon is running. So, if you don't force OMP_NUM_THREADS or KMP_PLACE_THREADS you'll get the right number of threads to fill the available hardware resources. Of course, if you do force them and don't realize that you should avoid the last core then you'll have oversubscription and, probably, bad performance. (This is classic OpenMP behaviour. It's a sharp tool with which it's easy to cut yourself. If you tell the runtime to do something it will, even when you're asking it to do something which is normally stupid :-)).

p.s. If you look in the runtime source (available from openmprtl.org) you can see that "auto" uses the "guided analytic chunked" schedule.

Thanks for the helpful comment.  I take it that schedule(auto) is referred to internally at Intel as guided-analytical, resembling guided but with the chunk size calculated by some algorithm depending on number of threads and number of remaining iterations.  Apparently, "monotonous" is being used with the intended meaning monotonic, as one of the posters on the OpenMP site guessed over 2 years ago.

The attached source code sets schedule(runtime) for the examples which have triangular matrices, where maximum work per outer loop iteration is twice the average.  The alternative of calculating explicitly which outer loop iterations to assign to each thread doesn't work as well on MIC as on dual CPU Xeon host (or even a recent single CPU laptop), presumably due to better tolerance of non-NUMA placement on MIC and worse overhead for simple-minded calculation of thread chunk limits using sqrt().  Default static scheduling could at most double the run time over some optimum.

I've already mentioned that MIC_KMP_PLACE_THREADS allows for double the performance of the default.  The main objections I've seen are that documentation isn't easily found and that it's not supported on host; maybe also that it's complicated to use with MPI (although simpler than the proposal in the Jeffers, Reinders book).  Offload mode does reserve one core for MPSS etc. but the visual in micsmc-gui is clearer than with native mode when specialization is achieved with even division of application work across most cores with system activity on the others.

The openmp source code side-tracked me into wondering whether it would work with my gfortran and gcc builds. There are interesting complaints about not accepting gnu perl, compilers, or ld, and not allowing Intel compilers with the location,link option to specify the Intel-compatible linker.

Apparently, "monotonous" is being used with the intended meaning monotonic, as one of the posters on the OpenMP site guessed over 2 years ago.

It may just be reflecting the reality that the OpenMP runtime is pretty monotonous :-)

For operations where the loop iteration costs scale as n, (your triangular case may be like this), it's worth also trying a "schedule(static,1)". *if* it doesn't break the cache behaviour completely (and you may be able to fix that by using a suitable chunk size instead of 1), it should  improve the work allocation since it uses a blocked-cyclic allocation.  

Consider 

#pragma omp parallel for
for (int i=0; i<100; i++)
    for (j=0; j<i; j++)
        ... do something...

running on ten threads.

With the default schedule(static), thread zero will get iterations 0..9, thread 9 iterations 90..99, so thread zero does a total of 45 pieces of work, whereas thread nine does 945. With a schedule(static,1), though, thread zero gets iterations 0,10,20,...,90, and thread nine gets 9,19,29,...,99. So thread zero  executes 450 pieces of work, and thread nine 540. 

Still not perfect, but much better, and it's still a static, cheap to calculate scheduling scheme.

I've already mentioned that MIC_KMP_PLACE_THREADS allows for double the performance of the default.  

That makes little sense to me. All KMP_PLACE_THREADS does is allow you to restrict which hardware threads you are able to use, making it much easier to get  1,2,3 or 4 threads/core in subsets of the machine. But, that's all it does. It doesn't affect the other affinity choices ("scatter", "compact"). So, unless something else (like forced over-subscription) is going on, I don't see how it can have any effect beyond making it easier to get the distribution you wanted. (So, if you were to force a distribution by hand [yeuch] it will have the same performance as one created using KMP_PLACE_THREADS).

The main objections I've seen are that documentation isn't easily found 

I think that can only come from someone with little Google-fu. Assuming you know what you're looking for Google homes in on it pretty fast. (It suggests kmp_place_threads as the search I want when I type kmp_place). The top hits are then all relevant. 

The openmp source code side-tracked me into wondering whether it would work with my gfortran and gcc builds. There are interesting complaints about not accepting gnu perl, compilers, or ld, and not allowing Intel compilers with the location,link option to specify the Intel-compatible linker.

We test building for gcc, and it should all "just work". (Not using gcc for KNC, though).

static,3 is slightly better than auto.  Trying this on mic offload hadn't occurred to me.

Cache locality certainly is a factor in the ability to improve performance up to 177 threads in MIC native mode, as well as in the use of explicit programmed scheduling for multi-cpu host.  I've assumed this is also a factor in why Cilk(tm) Plus on MIC doesn't give useful performance scaling beyond 59 workers in comparisons with OpenMP which does scale further.

Certainly, KMP_PLACE_THREADS is useful primarily as a shortcut way to improve on defaults.  In the case of 3 threads/core , which is optimum for MIC native on this and several other applications (or in the case of reserving groups of cores for each host process, covered by Jeffers & Reinders), it's much easier than listing the required hardware threads.  I'm certainly in favor of its use, but I've been getting the impression others aren't.

Certainly, KMP_PLACE_THREADS is useful primarily as a shortcut way to improve on defaults.  In the case of 3 threads/core , which is optimum for MIC native on this and several other applications (or in the case of reserving groups of cores for each host process, covered by Jeffers & Reinders), it's much easier than listing the required hardware threads.  I'm certainly in favor of its use, but I've been getting the impression others aren't.

Since it was my idea, I like it. I had just seen too many codes that were explicitly forcing exactly the worst possible affinity on KNC (forcing the OpenMP serial thread onto logical CPU zero...) so having some simpler way of getting sensible logical CPU allocations seemed essential. 

Login to leave a comment.