Parallel offload and memory retention in OpenMP 4.0

Parallel offload and memory retention in OpenMP 4.0

Hi,
I have two questions:

1) Is there any way to have memory retention with #pragma omp target device(0), similar to #pragma offload target(mic:0) in(p: length(size) align(align) alloc_if(i==0) free_if(i==reps-1))?
2) Why parallel offload without memory retention is serialized?

Let's take a look at the following code:

#include <stdio.h>
#include <omp.h>

#if TEST==1
#define OFFLOAD offload_transfer target(mic:m) in(p: length(size) align(align) alloc_if(i==0) free_if(i==reps-1))
#elif TEST==2
#define OFFLOAD offload_transfer target(mic:m) in(p: length(size) align(align))
#elif TEST==3
#define OFFLOAD omp target device(m) map(to:p[0:size])
#endif

int main(int argv, char** argc){

    int mics = 1;
    if (argv>1) mics = atoi(argc[1]);

    int reps = 3;
    if (argv>2) reps = atoi(argc[2]);

    int align = 64;
    if (argv>3) align = atoi(argc[3]);

    for (size_t size = 1L; size < 1L<<34; size *= 2){
        char * data[mics];
        for(int m=0; m<mics; m++) 
            data[m]=(char*) _mm_malloc(size, align);
        for (int i = 0; i<reps ; i++){
            {
                for(int m=0; m<mics; m++) 
                    data[m][0:size] = i;
                double time = 0.0;
                double bw = 1.0;
#pragma omp parallel for reduction(+:bw) reduction(max:time)
                for(int m = 0; m < mics; m++){
                    char * p = data[m];
                    const double t1 = omp_get_wtime();
#pragma OFFLOAD
                            { }
                    time = omp_get_wtime() - t1;
                    bw = (size / time)/ (1L<<20);
                }
                printf("out: %6d\t%6d\t%12ld\t%9.6f\t%9.3f\n", 
                        mics, i, size, time, bw);
            }
        }
        for(int m = 0; m < mics; m++)
            _mm_free(data[m]);
    }
}

I'm getting the following result for serial offload to one Intel Xeon Phi coprocessor:

 

First offload with both #pragma offload and #pragma omp target is very slow, with the maximum bandwidth close to only 0.5GB/s, which can be improved by using huge 2MB pages: export MIC_USE_2MB_BUFFERS=0. This will increase the bandwidth to 1.2GB/s (see the next plot).

Regular offload pragmas allocate and deallocate memory on Xeon Phi coprocessor on each offload call. Maximum bandwidth for second and all later offload calls is 2.2GB/s.

By using memory retention we can get to the PCIe v2 bandwidth limit, which is ~6.3GB/s.

Using 2MB pages we can improve initial offload of new data for both #pragma offload and #pragma omp target:

But I think the most interesting part is parallel offload. Let's take a look at scaling offload bandwidth if we use 1, 2, 3, and 4 Intel Xeon Phi coprocessors. Data transfer should run in parallel, because we using up to 4 OpenMP threads on the host system.

This result is a little bit confusing. Regular parallel #pragma offload and #pragma omp target without memory retention showed almost the same bandwidth as serial offload to one device. And only offload with memory retention scales linearly with the number of devices.

Using modified code of Intel PCM tool (https://software.intel.com/en-us/articles/intel-performance-counter-monitor) and also monitoring the memory of Intel Xeon Phi coprocessors, I've been able to plot the timeline of memory utilization and data transfer over PCIe buses on Socket 0 and Socket 1 of the host system. 8GB of data transfered to each of 4 Intel Xeon Phi coprocessor 5 times. We comparing regular parallel offload with #pragma offload and #pragma omp target (top), and parallel offload with memory retention (bottom):

It is obvious from this timeline plot, that for the regular parallel #pragma offload and #pragma omp target data transfer to each Xeon Phi coprocessor is serialized. And only if we use memory retention with alloc_if(0)/free_if(0) clauses we observe simultaneous data transfer to each device. Therefore, I'm wondering if it possible to have memory retention with OpenMP 4.0 target pragma. And why we observing this serialization at all? May be there is some implementation limits in offload runtime library? Or may be it's a bug? Can anyone explain what is going on behind the scenes? Any help will be appreciated. Thank you!

 

AttachmentSize
Downloadtext/x-c++src offload_1.cpp3.17 KB
Downloadimage/png fig1.png116.65 KB
Downloadimage/png fig2_3.png119.27 KB
Downloadimage/png fig3a.png181.37 KB
Downloadimage/png timeline2MB.png135.56 KB
10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

OpenMP currently support memory retention in lexically structured scope using omp target data to create the region and omp target update to do the transfer.  OpenMP is currently working on adding unstructure data creating.  See example below.

Regarding the serial behavior, Earlier MPSS was using a mutext lock for programming DMA and this had been addressed in new MPSS.

If you have additional question you can contact me at ravi.narayanaswamy@intel.com

 

   for (size_t size = 1L; size < 1L<<34; size *= 2){
        char * p;
        p =(char*) _mm_malloc(size, align);
#pragma omp target data device(0) map(tofrom:p[0:size])
{
        for (int i = 0; i<reps ; i++){
            {
                double time = 0.0;
                double bw = 1.0;
                const double t1 = omp_get_wtime();
#pragma omp target update to(p[0:size])
#pragma omp target update from(p[0:size])
                time = omp_get_wtime() - t1;
                    bw = (size / time)/ (1L<<20);
                printf("out: %6d\t%6d\t%12ld\t%9.6f\t%9.3f\n",
                        mics, i, size, time, bw);
            }
        }
        _mm_free(p);
}
    }
}
 

Thank you very much Ravi, your code works perfectly. I've modified it slightly for the parallel offload:

#include <stdio.h>
#include <omp.h>


int main(int argv, char** argc){

  int mics = 1;
  if (argv>1) mics = atoi(argc[1]);

  int reps = 3;
  if (argv>2) reps = atoi(argc[2]);

  int align = 64;
  if (argv>3) align = atoi(argc[3]);

  omp_set_num_threads(mics);
  //for (size_t size = 1L; size < 1L<<34; size *= 2){
  for (size_t size = 1024L; size < 1L<<34; size *= 1L<<23){
#pragma omp parallel
    {
      const int m = omp_get_thread_num();
      char * p;
      p =(char*) _mm_malloc(size, align);
#pragma omp target data device(m) map(to:p[0:size])
      {
        for (int i = 0; i<reps ; i++){
          {
            double time = 0.0;
            double bw = 1.0;
            p[0:size] = i;
#pragma omp barrier
            const double t1 = omp_get_wtime();
#pragma omp target update device(m) to(p[0:size])
            {}
#pragma omp barrier
            time = omp_get_wtime() - t1;
            bw = (size*mics / time)/ (1L<<20);
#pragma omp master
            printf("out: %6d\t%6d\t%12ld\t%9.6f\t%9.3f\n",
                mics, i, size, time, bw);
          }
        }
        _mm_free(p);
      }
    }
  }
}

This code shows 25 GB/s for the parallel offload:

[vadim@c001-n003 ~]# MIC_USE_2MB_BUFFERS=0 ./a.out 4 5
out:      4	     0	        1024	 0.000691	    5.654
out:      4	     1	        1024	 0.000587	    6.655
out:      4	     2	        1024	 0.000629	    6.211
out:      4	     3	        1024	 0.000586	    6.666
out:      4	     4	        1024	 0.000575	    6.793
out:      4	     0	  8589934592	 1.277172	25656.683
out:      4	     1	  8589934592	 1.277156	25657.009
out:      4	     2	  8589934592	 1.277130	25657.531
out:      4	     3	  8589934592	 1.277096	25658.211
out:      4	     4	  8589934592	 1.277349	25653.130

[vadim@c001-n003 ~]# micinfo | grep MPSS
		MPSS Version		: 3.3

Based on OFFLOAD_REPORT=3 output, data transfered to Xeon Phi coprocessors in line 23 as well. And this transfer is still serialized, as well as memory allocation on MICs. Although, I have the latest MPSS 3.3 version installed on my machine.

 

Attachments: 

AttachmentSize
Downloadimage/png target_0.png74.13 KB

Can you change the map(to   to map(alloc   to see if allocation is the probelm.

Also I am trying to confirm which version of MPSS has the mutex lock fix.

#pragma omp target data device(m) map(alloc:p[0:size])

Yep, with just alloc clause there's no data transfer, but only allocation of the space, as indicated by the OFFLOAD_REPORT=2. But this allocation is still serial.

[Offload] [MIC 0] [File]            openmp4.cpp
[Offload] [MIC 0] [Line]            24
[Offload] [MIC 0] [Tag]             Tag 28
[Offload] [MIC 1] [File]            openmp4.cpp
[Offload] [MIC 1] [Line]            24
[Offload] [MIC 1] [Tag]             Tag 29
[Offload] [MIC 2] [File]            openmp4.cpp
[Offload] [MIC 2] [Line]            24
[Offload] [MIC 2] [Tag]             Tag 30
[Offload] [MIC 3] [File]            openmp4.cpp
[Offload] [MIC 3] [Line]            24
[Offload] [MIC 3] [Tag]             Tag 31
[Offload] [HOST]  [Tag 30] [CPU Time]        33.293318(seconds)
[Offload] [MIC 2] [Tag 30] [CPU->MIC Data]   0 (bytes)
[Offload] [MIC 2] [Tag 30] [MIC Time]        0.000037(seconds)
[Offload] [MIC 2] [Tag 30] [MIC->CPU Data]   8 (bytes)

[Offload] [HOST]  [Tag 28] [CPU Time]        33.293800(seconds)
[Offload] [MIC 0] [Tag 28] [CPU->MIC Data]   0 (bytes)
[Offload] [MIC 0] [Tag 28] [MIC Time]        0.000036(seconds)
[Offload] [MIC 0] [Tag 28] [MIC->CPU Data]   8 (bytes)

[Offload] [HOST]  [Tag 31] [CPU Time]        33.293359(seconds)
[Offload] [MIC 3] [Tag 31] [CPU->MIC Data]   0 (bytes)
[Offload] [MIC 3] [Tag 31] [MIC Time]        0.000037(seconds)
[Offload] [MIC 3] [Tag 31] [MIC->CPU Data]   8 (bytes)

[Offload] [HOST]  [Tag 29] [CPU Time]        33.293559(seconds)
[Offload] [MIC 1] [Tag 29] [CPU->MIC Data]   0 (bytes)
[Offload] [MIC 1] [Tag 29] [MIC Time]        0.000032(seconds)
[Offload] [MIC 1] [Tag 29] [MIC->CPU Data]   8 (bytes)

[Offload] [MIC 2] [File]            openmp4.cpp
[Offload] [MIC 2] [Line]            33
[Offload] [MIC 2] [Tag]             Tag 32
[Offload] [MIC 3] [File]            openmp4.cpp
[Offload] [MIC 3] [Line]            33
[Offload] [MIC 3] [Tag]             Tag 34
[Offload] [MIC 1] [File]            openmp4.cpp
[Offload] [MIC 1] [Line]            33
[Offload] [MIC 1] [Tag]             Tag 33
[Offload] [MIC 0] [File]            openmp4.cpp
[Offload] [MIC 0] [Line]            33
[Offload] [MIC 0] [Tag]             Tag 35
[Offload] [HOST]  [Tag 35] [CPU Time]        1.249994(seconds)
[Offload] [MIC 0] [Tag 35] [CPU->MIC Data]   8589934600 (bytes)
[Offload] [MIC 0] [Tag 35] [MIC Time]        0.000033(seconds)
[Offload] [MIC 0] [Tag 35] [MIC->CPU Data]   8 (bytes)

[Offload] [HOST]  [Tag 32] [CPU Time]        1.253751(seconds)
[Offload] [MIC 2] [Tag 32] [CPU->MIC Data]   8589934600 (bytes)
[Offload] [MIC 2] [Tag 32] [MIC Time]        0.000036(seconds)
[Offload] [MIC 2] [Tag 32] [MIC->CPU Data]   8 (bytes)

[Offload] [HOST]  [Tag 34] [CPU Time]        1.275117(seconds)
[Offload] [MIC 3] [Tag 34] [CPU->MIC Data]   8589934600 (bytes)
[Offload] [MIC 3] [Tag 34] [MIC Time]        0.000035(seconds)
[Offload] [MIC 3] [Tag 34] [MIC->CPU Data]   8 (bytes)

[Offload] [HOST]  [Tag 33] [CPU Time]        1.276195(seconds)
[Offload] [MIC 1] [Tag 33] [CPU->MIC Data]   8589934600 (bytes)
[Offload] [MIC 1] [Tag 33] [MIC Time]        0.000034(seconds)
[Offload] [MIC 1] [Tag 33] [MIC->CPU Data]   8 (bytes)

out:      4	     0	  8589934592	 1.276205	25676.124

Attachments: 

AttachmentSize
Downloadimage/png target_0.png71.69 KB

Vadim,

Have you tried adding the signal(n) specifier?

Jim Dempsey

Quote:

jimdempseyatthecove wrote:

Have you tried adding the signal(n) specifier?

Hi Jim,

Correct me, if I'm wrong: There's no signal(n) clause for #pragma omp target (https://software.intel.com/en-us/node/524532)

If we talking about regular asynchronous offload with #pragma offload target(mic:m) signal(&data[m]), It can be done from a single thread on the host. BTW, this topic will be covered in our next white paper (research.colfaxinternational.com). But basically, it looks like with asynchronous offload from a single thread uses only one memory controller on the host. And the bandwidth limit of this parallel asynchronous offload (to multiple cards) is equal to the bandwidth of memory controller, something close to 9-10 GB/s.

Hi Vadim,

This seems to indicate otherwise:

https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-F66EEDA2-2FB9-4952-A8FC-E997F92DDF0A.htm

!?!...

Ah, I see now, it is "omp target" not "offload target"

In reading the link you provided, I have to ask myself:

Why the heck is there a "#pragma omp ...", for all intents and purposes, that is a single thread context?

To my "weird" way of looking at things this is equivalent to:

#pragma omp critical
{
#pragma offload target...
{
...
}
}

I see no benefit "#pragma offload omp target ..." if all the offload clauses are not supported.

This said, the "omp target" is not intended to be used only with Intel MIC devices.

Jim Dempsey

Oops

The pseudo code snip should have used

#pragma omp critical(HiddenGlobalNameForAllOmpOffloads_possibly_one_each_target)

Jim

You have a parallel loop and an offload inside the parallel loop with each thread offloading to a different card.

The default behavior of offload library is that each  thread on the host tries to initialize all the cards, not just the card it offloads to. To initialize the card we use separate locks for each device,  so the 1st thread gets the lock to initialize device 0  and all other threads wait for the lock,  When the 1st thread releases the lock and get the lock for the device 1,  next threads gets the lock for device 0 and sees the card is already initialized and waits to  initialize device 1 since 1st thread is initializing the device holding the lock.   This behavior is repeated for all host threads and devices.

Change the default behavior by using env OFFLOAD_INIT=on_offload.   The host threads will only initialize the device it will be using, your code has each host thread using different devices they will all run in parallel.

Leave a Comment

Please sign in to add a comment. Not a member? Join today