Offload SegFault

Offload SegFault

I'm trying to run a simple offload example performing a matrix-matrix multiplication from http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Intel-Xeon-Phi.pdf, but I'm running into a segmentation fault. The code I have is

main()
{
   double *a, *b, *c;
   int i,j,k, ok, n=100;

   ok = posix_memalign((void**)&a, 64, n*n*sizeof(double));
   ok = posix_memalign((void**)&b, 64, n*n*sizeof(double));
   ok = posix_memalign((void**)&c, 64, n*n*sizeof(double));

   for ( i = 0; i < n; i++)
   {
      for ( j = 0; j < n; j++)
      {
         a[i*n+j] = i + 0.1;
         b[i*n+j] = i + 0.2;
         c[i*n+j] = 0.0;
      }
   }

   #pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n))
   {
      #pragma omp parallel for
      for( i = 0; i < n; i++ ) 
      {
         for( k = 0; k < n; k++ ) 
         {
            #pragma vector aligned
            #pragma ivdep
            for( j = 0; j < n; j++ ) 
            {
               c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];
            }
         }
      }
   }
}

And I am compiling with

icc -openmp -O3 -vec-report=3 offmem.c

And OFFLOAD_REPORT=3 gives

[Offload] [HOST]  [State]   Initialize logical card 0 = physical card 0
[Offload] [HOST]  [State]   Initialize logical card 1 = physical card 1
[Offload] [HOST]  [State]   Initialize logical card 2 = physical card 2
[Offload] [HOST]  [State]   Initialize logical card 3 = physical card 3
[Offload] [MIC 0] [File]            offmem.c
[Offload] [MIC 0] [Line]            20
[Offload] [MIC 0] [Tag]             Tag 0
[Offload] [HOST]  [Tag 0] [State]   Start Offload
[Offload] [HOST]  [Tag 0] [State]   Initialize function __offload_entry_offmem_c_20mainicc1301341592gVvZIa
[Offload] [HOST]  [Tag 0] [State]   Create buffer from Host memory
[Offload] [HOST]  [Tag 0] [State]   Create buffer from MIC memory
[Offload] [HOST]  [Tag 0] [State]   Create buffer from Host memory
[Offload] [HOST]  [Tag 0] [State]   Create buffer from MIC memory
[Offload] [HOST]  [Tag 0] [State]   Create buffer from Host memory
[Offload] [HOST]  [Tag 0] [State]   Create buffer from MIC memory
[Offload] [HOST]  [Tag 0] [State]   Send pointer data
[Offload] [HOST]  [Tag 0] [State]   CPU->MIC pointer data 240000
[Offload] [HOST]  [Tag 0] [State]   Gather copyin data
[Offload] [HOST]  [Tag 0] [State]   CPU->MIC copyin data 16 
[Offload] [HOST]  [Tag 0] [State]   Compute task on MIC
[Offload] [HOST]  [Tag 0] [State]   Receive pointer data
[Offload] [HOST]  [Tag 0] [State]   MIC->CPU pointer data 80000
[Offload] [MIC 0] [Tag 0] [State]   Start target function __offload_entry_offmem_c_20mainicc1301341592gVvZIa
[Offload] [MIC 0] [Tag 0] [Var]     a  IN
[Offload] [MIC 0] [Tag 0] [Var]     b  IN
[Offload] [MIC 0] [Tag 0] [Var]     c  INOUT
[Offload] [MIC 0] [Tag 0] [Var]     i  INOUT
[Offload] [MIC 0] [Tag 0] [Var]     n  INOUT
[Offload] [MIC 0] [Tag 0] [Var]     k  INOUT
[Offload] [MIC 0] [Tag 0] [Var]     j  INOUT
[Offload] [MIC 0] [Tag 0] [State]   Scatter copyin data
Segmentation fault (core dumped)

Any ideas whats going on?

Thanks in advance!

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The seg-fault occurs related to unaligned access for array b within the offload code. The program runs successfully when commenting out the #pragma vector aligned directive or disabling unrolling (-unroll 0). Let me inquire w/Development what's going on.

The alignment assertion must fail on b[] for some values of k, unless n is a multiple of 8.  For  c[], it looks like n would need to be a multiple of 8 times number of threads.

Note that MKL sets a far larger minimum size for offloading matrix multiplication, so as to have reasonable expectation of out-performing host.

Kevin, thank you for the suggestion! It is working with the -unroll 0 directive. I am interested in timing now. The total program goes as

t_start = dtime(); 

//... doing small things here ...

#pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n))
   {
      #pragma omp parallel for
      for( i = 0; i < n; i++ ) 
      {
         for( k = 0; k < n; k++ ) 
         {
            #pragma vector aligned
            #pragma ivdep
            for( j = 0; j < n; j++ ) 
            {
               c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];
            }
         }
      }
   }

//... doing more small things here ...

t_finish = dtime();
total_time = t_finish - t_start;
if (total_time > 0.0)
{
   printf("%10.3lf\n", total_time);
}

Where dtime() is taken from Intel Xeon Phi Coprocessor High-Performance Programming by Jim Jeffers and James Reinders. Explicitly, it is

static double dtime()
{
   double tseconds = 0.0;
   struct timeval mytime; 
   gettimeofday(&mytime, (struct timezone *) 0);
   tseconds = (double)(mytime.tv_sec + (double)mytime.tv_usec*1.0e-6);
   return(tseconds);
}

So the intersting thing is that when I run this code, I get a runtime of 3.999. However, the offload report tells me

[Offload] [MIC 0] [File]            offmem.c
[Offload] [MIC 0] [Line]            209
[Offload] [MIC 0] [Tag]             Tag 0
[Offload] [HOST]  [Tag 0] [CPU Time]        3.980781(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data]   437859976 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time]        1.385721(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data]   1059976 (bytes)

So the total time (CPU Time + MIC Time) is 5.366502. What's up with the discrepancy? I'm most interested in the time it takes to move the data back and forth between CPU->MIC and MIC->CPU.

You're timing the total host wall clock time, part of which is spent running on MIC.  Since you're using OpenMP, omp_get_wtime() seems more straightforward, but ought to get you a similar result.  If you want total time spent among all threads, host + MIC, you might use clock() both on host and on MIC (just once, not once per thread), but that has no relation to finding out data transfer times.

-unroll 0 is probably appropriate for a benchmark which isn't big enough to be worth running on MIC, but it's not a fix for alignment problems.

Among other things, you seem to be asking the same questions discussed at

https://software.intel.com/en-us/forums/topic/345586

Probably at least 0.5 sec is spent allocating data structures, time which might not be spent again if repeating the same benchmark.

If you don't optimize number of threads and affinity, your actual MIC runtime is probably a lot larger than necessary; you might be able to get that part down to negligible compared with data allocation and transfer. 

Tim, same base problem, different issues. I allocate with

double *a = (double *)_mm_malloc(sizeof(REAL)*ELEMENTCOUNTP, 64);

where ELEMENTCOUNTP is padded to fit the 512KB MIC cache with

#define PAD64 1
#if PAD64
   #define ELEMENTCOUNTP ((((ELEMENTCOUNT*sizeof(REAL))+63)/64)*(64/sizeof(REAL)))
#else
   #define ELEMENTCOUNTP ELEMENTCOUNT
#endif

and when transferring into my function where the matrix-matrix multiplication takes place, I have

__assume_aligned(a, 64);

What else can I do to fix whatever alignment problems are going on?

I also only time computation, not allocation. My question is now, how do I get the time it takes to offload data? I know how to calculate the theoretical data transfer speeds, but according to the book I mentioned, "We can only expect to achieve an effective peak on the order of 50 or 60 perfect of the specified maximum transfer rate in even ideal circumstances."

 

main()
{
   double *a, *b, *c;
   int i,j,k, ok, n=100; // n=100

   ok = posix_memalign((void**)&a, 64, n*n*sizeof(double));
   ok = posix_memalign((void**)&b, 64, n*n*sizeof(double));
   ok = posix_memalign((void**)&c, 64, n*n*sizeof(double));

   for ( i = 0; i < n; i++)
   {
      for ( j = 0; j < n; j++)
      {
         a[i*n+j] = i + 0.1;
         b[i*n+j] = i + 0.2;
         c[i*n+j] = 0.0;
      }
   }

   #pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n))
   {
      #pragma omp parallel for
      for( i = 0; i < n; i++ ) 
      {
         for( k = 0; k < n; k++ ) 
         {
            #pragma vector aligned
            #pragma ivdep
            for( j = 0; j < n; j++ ) 
            {
//*** ask yourself what happens when i=1, j=0?
               c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];
// *** [1*100+0] = [100]
// *** offset from c is 800 bytes
// *** 800%64 == 32 *** not vector aligned
            }
         }
      }
   }
}

See *** comments and you will understand what Tim is getting at.

Jim Dempsey

Note, if n were a multiple of 8 then your code would work

(8*sizeof(double) = size of vector)

By using "#pragma vector aligned" you made a contract with the compiler that the for(j loop would begin at vector aligned offsets for a, b, c. you violated the contract. Using multiple of 8 for n (MIC), or multiple of 2 (SSE) or 4 (AVX) would permit you to assert to the compiler that the vectors were aligned (and when using aligned allocations as you did).

Jim Dempsey

 

So if this were a non-square problem, with mxn, would m*n need to be a multiple of 8? And if so, would rounding m*n up to the nearest multiple of 8 to allocate, while keeping the loops the original m and n, work? (I'll check for myself in the meantime!)

Edit: answered my own question. (yes)

Now onto timing. Is it possible to get (time up to matrix multiplication) + (time to offload a, b, c) + (MIC time) + (time to offload c) + (time from matrix multiplcation to finish)?

Matt,

Also keep in mind that the first offload has the additional overhead of transferring the MIC side of the application over to the MIC, and the first offload with OMP region has the overhead of establishing the OpenMP thread pool (this may also be the first offload).

If you are looking at best matrix multiply (other than as a learning tool), then use MKL (or other mature library). Intel has spent a large amount of effort to highly tune this library.

As a general rule of thumb, the size of the matrices to multiply, and the advantage gained, has to out weigh the costs to do the offloads. Tim Prince, if he is reading this, would be more authoritative on this than I, but my guess is in excess of 32x32. There are other factors at play here as well. When the matrix multiply (offload) occurs within the OMP_BLOCKTIME interval of the last offload, then the lower limit would be small. If your offloads are further apart in time than the OMP_BLOCKTIME. then there is the additional latency of waking up the sleeping threads on the MIC, and therefor the lower limit would be larger.

Jim Dempsey

We've already seen that the time to initialize data structures when transferring to MIC may exceed the time the host would require to complete the problem, for a matrix size 100x100.  You will see in this reference

https://software.intel.com/en-us/articles/intel-math-kernel-library-inte...

that offloading matrix multiplication is advertised as a desirable capability only for 4096x4096 and larger, and it depends on using MKL to avoid excessive development effort.

As Jim hinted, you could experiment with raising the values of KMP_BLOCKTIME and MIC_KMP_BLOCKTIME to see whether these may improve the performance of repeat runs.  I won't be surprised to see repeat runs performing better than the first run even when BLOCKTIME expiration is involved.  You would likely need to optimize  MIC_KMP_PLACE_THREADS for your problem size before you could draw any conclusions.

Given a large enough problem size, MKL can use all the thread contexts effectively at default settings.  In earlier versions of MKL, MKL would not perform as well on MIC as your own compiled code ought to do until minimum matrix dimension exceeds 32, but that limitation should have been solved in the 11.2 version.  It still doesn't make offloading competitive until the minimum dimension exceeds some unspecified value, probably at least 300.  There have been a few posts about how to partition MIC so as to perform fairly well when running multiple simultaneous matrix multiplication problems.

Problem size is not an issue. I'm working with matrices about 436800000 bytes in size (54600000 double precision elements). With MKL, I haven't seen that the MIC is all that optimized for them, despite the claims. Compared to CPU the MIC runs MKL between 2 to 8 times slower (for dgemm and dsyrk, respectively).

Leave a Comment

Please sign in to add a comment. Not a member? Join today