This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.
Intel® Graphics Technology uses multiple SIMD execution units, each capable of simultaneously running several threads by sharing part of its resources. Running many simultaneous threads on each execution unit helps hide the latency of some operations. Significant increase of the number of threads adds to the cost of offloading code, but as a rule, a reasonably large number of threads delivers the best results.
The maximum number of target threads to parallelize loop nests of _Cilk_for parallel loops can be controlled via the GFX_MAX_THREAD_COUNT environment variable set before a heterogeneous application is started. The default value is -1, which means the runtime automatically determines the maximum thread count. The real thread count for a particular offload execution can be lower than the maximum, and is determined by the offload runtime, depending on the real iteration space for that execution.
The iteration space of the top-most _Cilk_for parallel loop may be insufficient to fully leverage Intel® Graphics Technology parallelism, especially if this loop is also vectorized. The number of iterations of the outermost loop of the offloaded loop nest may be lower than the number of target threads delivering the best performance. However, explicit collapsing of a loop nest into a single loop in source code can be inconvenient. The compiler tries to collapse perfectly nested parallel _Cilk_for loops under the offload pragma to parallelize the larger iteration space.
The compiler also generates a loop for _Cilk_for for the host version of the loop, but it is not parallelized. This can significantly affect performance when offload execution does not happen.