Phi asynchronous offload from host openMP parallel region -II

Phi asynchronous offload from host openMP parallel region -II

This question, in a way, is continuation of question asked before (https://software.intel.com/en-us/forums/topic/509845). In my code, I need to use offload in slightly more obfuscated fashion.  Main body in my code is a sequential for loop. This for loop has a variable workload distribution 
over the iterations, hence only in cases where workload is large enough, it offloads part of computations to MIC. The output of computations is not immediately fetched back to host CPU, but postponed until iteration when it is absolutely essential    to get certain portion of output on CPU.  While whole code is very long, I am providing a small code that simulates what I'm trying to achieve. 

#include <stdio.h>
#include <assert.h>
#include <math.h>
#include <omp.h>
#define ONCE alloc_if(1) free_if(1)
#define N 100
#define M 1000

int main_works()
{
    printf("Offload testing 102\n");
    double A[N],B[N],C[N],D[N];
    
    for (int i = 0; i < N; ++i)
    {
        A[i] = i;
        B[i] = i*i;
        C[i] =0.0;
    }
    
    int last_offload =-1;

    for (int k = 0; k < M; ++k)
    {
        /*offload for even numbers : Just a random condition for offload*/
        int offload_condition = (k%2==0);

        if (offload_condition)
        {
            /*wait for any previous offload to finish before starting new offload*/
            if (last_offload!=-1)
            {    
                #pragma offload_wait target(mic:1) wait(last_offload)    
                
                last_offload =-1;
            }

            printf("Offloading in \t%d th iteration \n",k );
            last_offload =k;

            #pragma offload target(mic:1) signal(last_offload) in(k) in(A[0:N] :ONCE) in(B[0:N] : ONCE) inout(C[0:N] : ONCE)
            {
                for (int i = 0; i < N; ++i)
                {
                    C[i] += (A[i]+B[i]);
                }
            }
            
        }

        int fetch_condition = ( (double) rand()/(double) RAND_MAX >0.5);
                
        if (fetch_condition)
        {
            printf("Fetching in  \t%d th iteration\n",k);
            if (last_offload!=-1)
            {
                
                #pragma offload_wait target(mic:1) wait(last_offload)
            }

            
            last_offload =-1;    

        }
        
        #pragma omp parallel default(shared)
        {

            // adding some dummy openMP work here
            #pragma omp for schedule(dynamic) nowait
            for (int i = 0; i < N; ++i)
            {
                D[i] -= A[i]*A[i]-1;
            }

            #pragma omp for schedule(dynamic) 
            for (int i = 0; i < N; ++i)
            {
                D[i] -= A[i]*A[i]-1;
            }


            
        }

    }

    if (last_offload!=-1)
    {
        #pragma offload_wait target(mic:1) wait(last_offload)    

        last_offload =-1;
        
    }    


    for (int i = 0; i < N; ++i)
    {

        if( fabs(C[i] - (A[i]*(A[i]+1)* (double)( M/2 )) ) > 0.001 ) 
        {
            printf(" i %d C[i] %lf A[i] %lf A[i]*(A[i]+1) %lf, M/2 %d   \n",i,C[i],A[i], A[i]*(A[i]+1), M/2 );
            // exit(0);
        }

        
    }
    printf("Returned successfully\n");
    return 0;
}

int main()
{
    printf("Offload testing 102\n");
    double A[N],B[N],C[N],D[N];
    
    for (int i = 0; i < N; ++i)
    {
        A[i] = i;
        B[i] = i*i;
        C[i] =0.0;
    }
    
    
    int s1 = 1;

    int last_offload =-1;

    for (int k = 0; k < M; ++k)
    {
        /*offload for even numbers : Just a random condition 
        to simulate  my code*/
        int offload_condition = (k%2==0);

        if (offload_condition)
        {
            /*wait for any previous offload to finish before starting new */
            if (last_offload!=-1)
            {    
                #pragma omp parallel default(shared)
                {
                    #pragma omp master 
                    {
                        #pragma offload_wait target(mic:1) wait(last_offload)    
                    }
                }
                last_offload =-1;
            }

            printf("Offloading in \t%d th iteration \n",k );
            last_offload =k;

            #pragma omp parallel default(shared)
            {
                #pragma omp master
                {
                    #pragma offload target(mic:1) signal(last_offload) in(k) in(A[0:N] :ONCE) in(B[0:N] : ONCE) inout(C[0:N] : ONCE)
                    {
                        for (int i = 0; i < N; ++i)
                        {
                            C[i] += (A[i]+B[i]);
                        }
                    }
                }
            }
        }

        
        #pragma omp parallel default(shared)
        {

            #pragma omp master
            {
                /*on some random iterations get back C*/
                int fetch_condition = ( (double) rand()/(double) RAND_MAX >0.5);
                
                if (fetch_condition)
                {
                    printf("Fetching in  \t%d th iteration\n",k);
                    if (last_offload!=-1)
                    {
                        #pragma offload_wait target(mic:1) wait(last_offload)
                    }
                    last_offload =-1;    
                }

            }

            // adding some dummy work
            #pragma omp for schedule(dynamic) nowait
            for (int i = 0; i < N; ++i)
            {
                D[i] -= A[i]*A[i]-1;
            }

            #pragma omp for schedule(dynamic) 
            for (int i = 0; i < N; ++i)
            {
                D[i] -= A[i]*A[i]-1;
            }
    
        }

    }

    if (last_offload!=-1)
    {
        #pragma offload_wait target(mic:1) wait(last_offload)    

        last_offload =-1;
        
    }    


    for (int i = 0; i < N; ++i)
    {

        if( fabs(C[i] - (A[i]*(A[i]+1)* (double)( M/2 )) ) > 0.001 ) 
        {
            printf(" i %d C[i] %lf A[i] %lf A[i]*(A[i]+1) %lf, M/2 %d   \n",i,C[i],A[i], A[i]*(A[i]+1), M/2 );
            // exit(0);
        }

        
    }
    printf("Returned successfully\n");
    return 0;
}

There are two main functions main and main_works . First function is a working reference code that demonstrate what I'm trying to achieve. 

Let's first look at main_works() functions.

for (int k = 0; k < M; ++k)

is the main sequential for loop in my code. On certain iterations it offloads the computations

int offload_condition = (k%2==0);

When offload_condition is 1 , then first, it waits for all previous offloads and then sets the last_offload variable to present iteration and offloads the computation to MIC. last_offload variable is used to keep track of when was the computation offloaded to MIC. Intention of doing (i.e. using last_offload) is to maintain maximum amount of asynchronicity between cpu and MIC. 

On some of the iterations, it fetches back data from MIC to host CPU (simulated here as follows)

int fetch_condition = ( (double) rand()/(double) RAND_MAX >0.5);

The above code works as expected. . However, I intend to do all the offload calls from an openMP parallel region.

The problem arises in main() function, where the offloads are done from omp parallel regions.  While the code given here doesn't illustrate why is it important to offload from a parallel region, but idea is that while master thread     is busy offloading, sending data and  waiting for output for MIC, rest of the threads are busy in work-sharing construct such as openmp for or other useful work.     In this case I get  following  error

offload error: device 1 does not have a pending signal for wait((nil))
Aborted (core dumped)

In the previous question, the solution suggested was to use of  "#pragma omp master" to do the MIC offloads. However, for some reasons that trick doesn't work here. I would like to understand why does it happens and how can I resolve this issue. 

It is also quite possible that there might be a better way to do what I try to do here using "last_offload",  if that is the case then I would like to know, if can be done differently while achieving this asynchronicity between cpu and MIC.

Sincerely,

Piyush 

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Try using:

signal(last_offload + &last_offfload)

and

wait(last_offload + &last_offfload)

The error message has "signal for wait((nil))" and not  "signal for wait((0))", implying possibly expecting the address of an arbitrary, valid, pointer.

Have you been able to determine if there are some reserved  values for "unique integer value" for signal/wait?

Jim Dempsey

www.quickthreadprogramming.com

setting ``last_offload+&last_offload'' as signal, didn't work for me. Did it work for you?  I compile it using

icc -O0 -o offload_test -openmp -std=c99 offload_test.c

I have not seen if there are any reserved values of signal, it doesn't mention anywhere on documentation (does it? ). 

Can you add shared(last_offload) to all the omp parallel constructs in which last_offload is set/used. The compiler is optimizing the last_offload as private which we will fix. Also best to use the address of last_offload in signal/wait

Thanks. It worked for given example.

For getting  my code running, I had to make all the variables that were in offload pragma (but not in MIC function body) explicitly shared including signal and length of arrays to get my code working. 

Many Thanks.

The issue Ravi noted is being tracked under the internal tracking id noted below. We will keep the post updated regarding the availability of the fix in a future release.
(Internal tracking id: DPD200255992)
(Resolution Update on 09/11/2014): This defect is fixed in the Intel® Composer XE 2013 SP1 Update 4 release (2013.0.4.211 - Linux)  -AND- the Intel® Parallel Studio XE 2015 Initial Release (2015.0.090 - Linux)

Ravi, Kevin,

The failing main was using default(shared), are you saying they need to be explicitly declared as shared?

Jim Dempsey

www.quickthreadprogramming.com

Ravi can correct/revise, but no, that's not what I understand we are saying. The defect is the signal was not granted shared as it should have been based on the main being default(shared). Instead, it was treated as private.

The associated defect discussed in this post is fixed in the Intel® Composer XE 2013 SP1 Update 4 release (2013.0.4.211 - Linux)  -AND- the Intel® Parallel Studio XE 2015 Initial Release (2015.0.090 - Linux). Both are available from the Intel Registration Center.

Great. Thanks.

Leave a Comment

Please sign in to add a comment. Not a member? Join today