The attached is plot of execution time on Intel Phi with varying number of threads. The same program runs in native and offload modes.
The Phi device has 60 cores.
1) Why the timing steps don't occur at multiples of number of cores (i.e., multiple of 60s)?
2) Why the time drops substantially around 248 threads and increases again? (i.e., > 4x60)