This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.
Scheduling many threads helps hide the latency of some operations. However, a large number of threads can increase the cost of offloading, or negatively affect performance. By default, the runtime attempts to schedule as many threads as the trip count of the parallelized loop nest, but for some tests the best results can be achieved with a much smaller thread count, such as 2-4 times greater than the hardware thread count. The hardware thread count depends on the hardware platform. To see the number of available hardware threads, run your program with the environment variable GFX_LOG_OFFLOAD set to 1.
To monitor the real thread count, set the GFX_LOG_OFFLOAD environment variable to 2.
To monitor the real thread count and the distribution of the parallel loop nest's iteration space, set the GFX_LOG_OFFLOAD environment variable to 3.
To override the default maximum value of threads, use the GFX_MAX_THREAD_COUNT environment variable. For example, to set the maximum value of threads to 32, enter:
Two values having special meaning:
- 1 (the default): The runtime creates one thread for each parallel loop nest iteration. Loops can be vectorized, and one iteration executed by a thread may include several original loop iterations.
0: The threadcount is N*hardware_thread_count, where N is:
4 for one-dimensional loop nests.
8 for loop nests of two or more dimensions. Try using this value when tuning your program for performance.