I'm testing my code on MIC using pragma offload + openmp model, and I have 2 questions so far.
1. when we use OFFLOAD_REPORT to do the timing, is it true that the data transfer time can be derived by subtracting "MIC time" from "CPU time"? So in this context, CPU time refers to the actual total time of the offloaded code, right?
2. some SIMT accelerator has high latency of global memory acess and the access has to be coalesced to prevent over-fetch, I'm wondering if these 2 problems exist on MIC, and what issues I should pay attention to with regard to efficient global data access? Specifically, my code has a non-vectorized (threads may execute totally different instructions most of the time), embarrassingly parallel nature, and global memory access follows a very random, scattered pattern. Will this potentially cause any performance penalty?
Thanks for clarifications!