Memory limit for "into" modifier on offload pragma

Memory limit for "into" modifier on offload pragma

I'm trying to use the "into" modifier on an offload pragma, and appear to be running into some limit. I have put together the following test program:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define ALLOC alloc_if(1) free_if(0)
#define REUSE alloc_if(0) free_if(0)
#define FREE  alloc_if(0) free_if(1)

//Succeeds for numBufs = 54, fails for numBufs = 55.

int main(int argc, char** argv)
	size_t len = 5000000;
	int** Array1 = new int*[224];
	int numBufs = 1;
	if (argc > 1)
		numBufs = atoi(argv[1]);
		if (numBufs > 224) numBufs = 224;
	printf("Creating %d buffers\n",numBufs);
	//Allocate the buffers.
	for (int i=0; i<numBufs; ++i)
		Array1[i] = new int[len];
		int* a1p = Array1[i];

#pragma offload target (mic : 0)  \
	in(i) \
	in(a1p[0:len] : ALLOC)
			if (i == 0)
				printf("Allocated %d %lld\n",i,a1p);

	//Try to use the first buffer.
	int* a1Data = new int[446400];
	int* a1p = Array1[0];
	a1Data[76] = 24;

#pragma offload target (mic : 0) \
	in(a1Data[0:446400] : REUSE into(a1p[0:446400]))
		printf("Offloaded into %lld\n",a1p);
		if (a1p[76] != 24)
			printf("\n\nFAILED: Value after offload = %d\n\n",a1p[76]);
		} else
	return 0;


You can see from the comment that this program works for numBufs=54 but fails for numBufs=55. The pointer to the buffer is the same between the allocation and the offload, but the data isn't copied correctly. Is this a bug, or is this expected behavior? The host is running Windows.

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Another thing that surprises me about this sample code is how long it takes to perform the allocating offloads. It takes ~5 seconds to allocate 55 buffers (each is ~20 MB each). Is this reasonable?

I reproduced the incorrect results with 55 buffers and see execution time ~5-6 seconds also. Let me discuss this w/Development and I'll post again when I know more.

The incorrect results appear to be a defect that I logged in our internal tracking system (see internal tracking id below) and the Developers are investigating. I'm not sure whether you looked at the OFFLOAD_REPORT=3 output. From this I can see the first card allocation takes a considerable longer time than subsequent allocations. There has been some history with the first allocation being slower so that may be what you are experiencing. From information I found for past internal reports about slow initial allocation there may have been some changes in the MPSS 3.2 time-frame on the MPSS to help improve this. I'll dig some more and see what more I can find about the slow allocation.

(Internal tracking id: DPD200256784)

Our offload developers are now consulting w/MPSS (COI) developers regarding the incorrect result that appears to involve a bad copy from a host buffer into the MIC (target) buffer. Our Developers were unable to reproduce this until I found it was only reproducing under MPSS 3.2 (on Windows and Linux). At least for Linux, the issue does not reproduce under the earlier MPSS 3.1.2 release. I did not have MPSS earlier than 3.2 on Windows to know whether the same is true on Windows but I’m pretty sure it would not occur under an earlier MPSS on Windows also.

I opened another internal tracking report with the MPSS team (as noted below) and will let you know more as I learn it. We were unable to find a work around other than what you already noted of limiting the number of buffers which may not be usable in your real application.

I am sorry I did not have better news at this time. I will keep updated when I know more.

(Internal tracking id: HSD # 4869195)

Login to leave a comment.