Crash on second call of offload region

Crash on second call of offload region

Hello,

I am experiencing a problem where an offload happens within a function, and the function crashes only if I call it twice. The code is this:

std::cout<<"BeforeOffload1.0,pInBuf="<<(std::size_t)(pInBuf)<<",len="<<lInBufLen<<std::endl;
#pragma offload target(mic:0) in(pInBuf:length(lInBufLen)) in(lInBufLen) out(lOutBufLen) 
{
std::cout<<"inOffload1.0"<<std::endl;
...
}

The first call of the function works very nicely, but the second call (with identical parameters) miserably crashes between the #pragma and the std::cout in the offload region, as shown in the log below. The question is: What can I do to further debug what is going on? Is there some environment variable that gives more detail on what is going on? Can I debug core dumps on the MIC? Other ideas?

Thanks for the help,

Georg

BeforeOffload1.0,pInBuf=139974661923544,len=407
[Offload] [MIC 0] [File] xxx.cpp
[Offload] [MIC 0] [Line] 404
[Offload] [MIC 0] [Tag] Tag0
inOffload1.0

...

[Offload] [MIC 0] [CPU Time] 0.000000 (seconds)
[Offload] [MIC 0] [CPU->MIC Data] 415 (bytes)
[Offload] [MIC 0] [MIC Time] 13.240497 (seconds)
[Offload] [MIC 0] [MIC->CPU Data] 8 (bytes)

... (Starting second call....)

BeforeOffload1.0,pInBuf=139973789507864,len=407
[Offload] [MIC 0] [File] xxx.cpp
[Offload] [MIC 0] [Line] 404
[Offload] [MIC 0] [Tag] Tag2
offload error: process on the device 0 was terminated by signal 11 (SIGSEGV)

With OFFLOAD_REPORT=3 (only crashing call):

BeforeOffload1.0,pInBuf=139925269953304,len=407
[Offload] [MIC 0] [File] xxx.cpp
[Offload] [MIC 0] [Line] 404
[Offload] [MIC 0] [Tag] Tag2
[Offload] [HOST] [Tag 2] [State] Start Offload
[Offload] [HOST] [Tag 2] [State] Initialize function __offload_entry_xxx
[Offload] [HOST] [Tag 2] [State] Create buffer from Host memory
[Offload] [HOST] [Tag 2] [State] Create buffer from MIC memory
offload error: process on the device 0 was terminated by signal 11 (SIGSEGV)

...

14 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

I was unable to reproduce the error. Could you post a reproducer. This is the code I tried: 

#pragma offload_attribute(push,target(mic))
#include"iostream"
#pragma offload_attribute(pop)
using namespace std;
__attribute__((target(mic))) float *pInBuf;
void func(void)
{
int lInBufLen=(4*1024*1024);
int lOutBufLen=(4*1024*1024);
std::cout<<"BeforeOffload1.0,pInBuf="<<(std::size_t)(pInBuf)<<",len="<<lInBufLen<<std::endl;
#pragma offload target(mic:0) in(pInBuf:length(lInBufLen)) in(lInBufLen) out(lOutBufLen)
{
std::cout<<"inOffload1.0"<<std::endl;
}
cout<<"Out";
}
int main()
{
 pInBuf=(float*)_mm_malloc(4*1024*1024*sizeof(float),4096);
 func();func();
 _mm_free(pInBuf);
return 0;
}

So far I have been unable tocreate a short reproducer :-( That's why I have been asking if there are additional ways to get information about what is going on. Apparently the problem happens after "[Offload] [HOST] [Tag 2] [State] Create buffer from MIC memory", which is somewhere inside the magic done by the MIC runtime system.

Exactly the same code runs fine when I just remove the "pragma offload", so there is some reason to believe that there is not a major flaw in my part of the software (but frankly it is still possible). It is just that I am out of ideas on how to debug this.

Georg

Let me ask around and get back to you. :) 

Sumedh,

thanks for taking care of this. Here is another observation that kind of puzzles me:

  • micsmc reports 263 MBytes memory consumption for an idle card.
  • I am starting my program with OFFLOAD_INIT=“on_start”, so the first offload does not consume time for loading code etc.. When the program is started, micsmc reports 569 MBytes memory consumption. Since I did not yet call any offload code, this seems to be downloaded code.
  • While the program is actually executing, memory consumption increases to several GBytes as expected
  • After running the offload code but before terminating the host side progam, micsmc reports 269 MBytes again. It appears the downloaded code has been released. If this is the case, it is no surprise if a second offload fails.

I have no idea what is going on here...

Georg

Hi Georg,

We are working/Sumedh to root cause the failure. Some questions have come up:

  • Can you confirm that the C++ std includes are inside a push/pop target(mic) pragma?
  • Can you confirm you are using compiler version 13.1.0.146 (icpc -V)?
  • What is your version of MPSS?
  • In the output snippets shown, there is a “Tag 0” and a “Tag 2” suggesting there is another offload not accounted for that would be “Tag 1”. Is there another offload between the two calls to the function?  Can you share any details about it?
  • Since you mentioned it in your last post, have you tried the default by not setting OFFLOAD_INIT?

Thank you in advance.

 

Kevin,

  • the relevant files are compiled with compiler option -offload-attribute-target=mic, which should be equivalent to push/pop for the whole file.
  • I am using icc 13.1.0.146 Build 20130121
  • Using micinfo, the MPSS version is MPSS Version 2.1.4982-15. I did not upgrade to the version released in early March because we are currently in the process of doing benchmarks.
  • Yes, there are 2 offload regions. It is my mechanism for returning a variable length result. Attempts to produce a similar problem with a small code snippet have failed, so I guess the general code is ok. Below is a code snippet that should help you understand the structure of the code.
  • Unsetting OFFLOAD_INIT does not make a difference, and also results in a crash on the second call.
  • BTW: Does the fact that the second call reports "Tag2" mean something to you? After all, it is really the same region as the one with tag0.

Code snippet below.

Georg

Host code calls SendToMic(). There, parameters are serialized and passed to offload computation, which is done in ReceiveComputeOnMic(). When ReceiveComputeOnMic() returns, the length of the result is returned from the offload region, a suitable buffer is allocated, and the result is tranfered to this buffer as part of the second offload. This two step mechanism is necessary because we dont know the length of the result when doing the first offload. The whole file is compiled with -offload-attribute-target=mic.

namespace {
/// on MIC: unpack data send from SendComputeToMic() via pInBuf, simulate, encode results in pOutBuf, and return length of pOutBuf
 ///
 /// Ownership of the malloc'ed pOutBuf goes to caller.
 std::size_t ReceiveComputeOnMic(const char * const pInBuf_p, std::size_t lInBufLen_p, char * &rpBuf_p)
 {....// do computation, which sets lOutBufLen and pBuffer. pBuffer is deallocated when method is exited (it is part of a std::string
    // that has been created within the method), and therefore is transfered to rpBuf_p
rpBuf_p=static_cast<char *>(malloc(lOutBufLen));
memcpy(rpBuf_p,pBuffer,lOutBufLen);
return lOutBufLen;
 }
 /// output data on offload side
 __attribute__((target(mic))) static char *pOffloadBufOut=NULL;
void SetPtr(char * ptr_p)
 {
 pOffloadBufOut=ptr_p;
 }
 char *GetPtr()
 {
 return pOffloadBufOut;
 }
/// On host: Transfer of Simulator on MIC side. Compute is run on MIC side via offload of ReceiveComputeOnMic()
result_t
 SendComputeToMic( input_t input_p)
 {
... // serialize input_p into pBuffIn, with length lInBufLength
...
__attribute__((target(mic))) const char *pInBuf=poSofStream->Get_Buf_Ptr(); // buffer for input data
 __attribute__((target(mic))) std::size_t lInBufLen=poSofStream->Get_Buf_Size(); // length of input data

__attribute__((target(mic))) std::size_t lOutBufLen=0; // length of output data
__attribute__((target(mic))) char *pBufOut=NULL; // output data host side
// we need two offload regions here because we only know the length of the result after running. The offload
 // pragmas require knowledge of data length before running offloaded code.
 // same code also works if we disable actual offload, which is helpful for debugging.
ICC_PRAGMA_OFFLOAD(offload target(mic:0) in(pInBuf:length(lInBufLen)) in(lInBufLen) out(lOutBufLen) )
 {
SetPtr(NULL);
lOutBufLen=ReceiveComputeOnMic(pInBuf,lInBufLen,ptr);
 SetPtr(ptr);
}
// copy back result data, and free on MIC
 pBufOut=static_cast<char *>(malloc(lOutBufLen));
ICC_PRAGMA_OFFLOAD(offload target(mic:0) in(lOutBufLen) out(pBufOut:length(lOutBufLen)))
 {
memcpy(pBufOut,GetPtr(),lOutBufLen);
 std::cout<<"inOffload2-3"<<std::endl;
 free(GetPtr());
}
// unpack results
...
return res;
 }
}

Thank you for the additional details. Compiler/MPSS versions used are fine. We just wanted to be in sync. Let us consider the new details and we’ll update again soon.

Each offload is assigned a sequential numeric “Tag” (starting with 0) in the OFFLOAD_REPORT, so “Tag 2” meant the offload region shown in your original post was the third offload that executed (on the second call to the function) which meant an additional offload (i.e. “Tag 1”) executed before the failure in the “Tag 2” offload.  Knowing the sequencing we’re pondering what conditions on the card might be different between offloads “Tag 0” (which succeeds) and “Tag 2” (which fails) that could contribute to the failure.

Kevin,

did you consider the behaviour outlined in my post dated Tue, 03/19/2013 - 09:04? Looking at the memory consumption, the Xeon Phi card seems to be empty after the first function call is completed.

Georg

Yes we still are.

Development replied that they cannot tell where “ptr” (used in line 39) is declared and that it becomes an implicit in/out for the offload construct at line 36. They asked if you can compile with -opt-report-phase=offload and provide that output and run using an option I can communicate privately.

36 ICC_PRAGMA_OFFLOAD(offload target(mic:0) in(pInBuf:length(lInBufLen)) in(lInBufLen) out(lOutBufLen) )

37  {

38 SetPtr(NULL);

39 lOutBufLen=ReceiveComputeOnMic(pInBuf,lInBufLen,ptr);

40  SetPtr(ptr);

41 }

Adding to my last post, the Developers also would like to know how ptr is assigned a value on line 39.

Quote:

Kevin Davis (Intel) wrote:

Development replied that they cannot tell where “ptr” (used in line 39) is declared and that it becomes an implicit in/out for the offload construct at line 36. They asked if you can compile with -opt-report-phase=offload and provide that output and run using an option. I can communicate privately....

Below is the output of compiling the file where the offload is done with -opt-report-phase=offload.

Georg

xxx.cpp(403-403):OFFLOAD:SendComputeToMic: Offload to target MIC <expr>
Data sent from host to target
pInBuf, pointer to string
lInBufLen, scalar size 8 bytes
Data received by host from target
lOutBufLen, scalar size 8 bytes

xxx.cpp(423-423):OFFLOAD:SendComputeToMic: Offload to target MIC <expr>
Data sent from host to target
lOutBufLen, scalar size 8 bytes
Data received by host from target
pBufOut, pointer to string

xxx.cpp(403-403):OFFLOAD:SendComputeToMic: Outlined offload region
Data received by target from host
pInBuf, pointer to string
lInBufLen, scalar size 8 bytes
Data sent from target to host
lOutBufLen, scalar size 8 bytes

xxx.cpp(423-423):OFFLOAD:SendComputeToMic: Outlined offload region
Data received by target from host
lOutBufLen, scalar size 8 bytes
Data sent from target to host
pBufOut, pointer to string

Quote:

Kevin Davis (Intel) wrote:

Development replied that they cannot tell where “ptr” (used in line 39) is declared and that it becomes an implicit in/out for the offload construct at line 36. ...

The missing declaration of ptr is a copy/paste error. In the real code, it is declared in line 37b (within the offload region) as "char *ptr;" . It should therefore be local to the offload block. It is set as part of the call to ReceiveComputeOnMic(pInBuf,lInBufLen,ptr) in line 9.

Georg

Connectez-vous pour laisser un commentaire.