I have a code running on the CPU and I offload a function, which is being called several times inside a nested loop, into Intel Xeon Phi to be accelerated. In addition to get good performance out of the offloading, Further, I allocated the main data structures out of the offloading section by transferring them to Xeon phi and allocating them once and then at the end of the loop (after the offloading section) I deallocate them, so that I would save time by avoiding copying the data back and forth into the accelerator memory since these data are not being modified inside the offloaded function. However, I am still having bad performance and I realized that this performance comes from the implicit transmission of the data when the offloaded function begins. Therefore, I specified inside the offload paragma of the offloaded function that these data should not be copied into Intel Xeon Phi memory by (nocopy) clause, but I got segmentation fault error (offload error: process on the device 1 was terminated by signal 11 (SIGSEGV)). So the conclusion is that Xeon phi needs to copy the data each and every time I offload the function into the coprocessor although these data do not get changed during the calculations inside the offloaded function.
Could anyone please help out of this issue?
BTW, the code is written in both C and Fortran, and the first offload transfer and allocation of the data structures happens in the C code and then later the Fortran code do the calculation by the offloaded function (which is written in Fortran).
Thanks very much.