Xeon Phi Segmentation Fault Simple Offload

Xeon Phi Segmentation Fault Simple Offload

I have this simple matrix multiply for offload on Phi, but I get  offload error (SIGSEGV) when I run the program below:

#include <stdlib.h>
#include <math.h>

void main()
{
    double *a, *b, *c; 
    int i,j,k, ok, n=100;

    // allocated memory on the heap aligned to 64 byte boundary 
    ok = posix_memalign((void**)&a, 64, n*n*sizeof(double)); 
    ok |= posix_memalign((void**)&b, 64, n*n*sizeof(double)); 
    ok |= posix_memalign((void**)&c, 64, n*n*sizeof(double));

    // initialize matrices 
    for(i=0; i<n; i++)
    {
        a[i] = (int) rand();
        b[i] = (int) rand();
        c[i] = 0.0;
    }
    
    //offload code 
    #pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n)) 
    
    //parallelize via OpenMP on MIC 
    #pragma omp parallel for 
    for( i = 0; i < n; i++ ) 
        for( k = 0; k < n; k++ ) 
            #pragma vector aligned 
            #pragma ivdep 
            for( j = 0; j < n; j++ ) 
                //c[i][j] = c[i][j] + a[i][k]*b[k][j]; 
                    c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];

}

What am I doing wrong?

I read a previous post that there might be a known bug in the release?

 

 

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Here is the program output:

[Offload] [MIC 0] [File]            matmul_offload.cpp
[Offload] [MIC 0] [Line]            19
[Offload] [MIC 0] [Tag]             Tag 0
offload error: process on the device 0 was terminated by signal 11 (SIGSEGV)

 >> c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];

n = 100
When i=1 and j=0 (start of inner loop) then c[i*n+j] is not aligned as you have so stated with #pragma vector aligned. Do not make false declarations.

Jim Dempsey

www.quickthreadprogramming.com

If you work with "#pragma vector aligned" on Xeon Phi, then, in addition to using an aligned allocator, you have to pad the inner loop dimension (in your case, "n") to a multiple of 8 in double precision or a multiple of 16 in single precision. Otherwise, as Jim Dempsey explained above, your declaration becomes false for i>0.

Thanks Andrey - its my first Offload code for xeon phi. I usually compile code for native runs.

Could you kindly give me an example, or point me to a resource?

Much thanks

Dave

Best Reply

Hi Dave,

in order to fix your code you can do something like below.

A nice paper about it is http://software.intel.com/en-us/articles/data-alignment-to-assist-vector...  . A comprehensive resource with practical examples that addresses vectorization, data alignment and optimization on Xeon Phi in general is http://www.colfax-intl.com/nd/xeonphi/book.aspx . Of course, asking me about resources is like asking Ronald McDonald to point out a good burger place in town.

Andrey

 

[code]

#include <stdlib.h>

#include <math.h>

void main()

{

    double *a, *b, *c;

    int i,j,k, ok, n=100;
    int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );

    // allocated memory on the heap aligned to 64 byte boundary

    ok = posix_memalign((void**)&a, 64, n*nPadded*sizeof(double));

    ok |= posix_memalign((void**)&b, 64, n*nPadded*sizeof(double));

    ok |= posix_memalign((void**)&c, 64, n*nPadded*sizeof(double));

 

    // initialize matrices

    for(i=0; i<n; i++)

    {

        a[i] = (int) rand();

        b[i] = (int) rand();

        c[i] = 0.0;

    }

    

    //offload code

    #pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))

    

    //parallelize via OpenMP on MIC

    #pragma omp parallel for

    for( i = 0; i < n; i++ )

        for( k = 0; k < n; k++ )

            #pragma vector aligned

            #pragma ivdep

            for( j = 0; j < n; j++ )

                //c[i][j] = c[i][j] + a[i][k]*b[k][j];

                    c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j];

}

 

[/code]

I am using that code to know if Xeon Phi has bettter perfomance that only-Xeon.  I commented the instructions #pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded)), #pragma vector aligned and #pragma ivdep for run on only-Xeon and uncommented that for run on Xeon-Phi but performance on only-Xeon is better than Xeon-phi, to complile I use icc -O3 -qopenmp matrixmatrix_mul.c -o matrixmatrix_mul.mic -mmic for Xeon-Phi and icc -O3 -qopenmp matrixmatrix_mul.c -o matrixmatrix_mul for only-Xeon. Please Could you help me with an simple example where using parallelization and vectorization Xeon-Phi performance is better than only-Xeon.

Leave a Comment

Please sign in to add a comment. Not a member? Join today