Programming Guide

Contents

Control Binary Execution on Multiple CPU Cores

Environment Variables

The following environment variables control the placement of DPC++ threads on multiple CPU cores during program execution.
Environment Variable
Description
DPCPP_CPU_CU_AFFINITY
Set thread affinity to CPU. The value and meaning is the following:
  • close - threads are pinned to CPU cores successively through available cores.
  • spread - threads are spread to available cores.
  • master - threads are put in the same cores as master. If DPCPP_CPU_CU_AFFINITY is set, master thread is pinned as well, otherwise master thread is not pinned
This environment variable is similar to the OMP_PROC_BIND variable used by OpenMP.
Default:
Not set
DPCPP_CPU_SCHEDULE
Specify the algorithm for scheduling work-groups by the scheduler. Currently, DPC++ uses TBB for scheduling. The value selects the petitioner used by the TBB scheduler. The value and meaning is the following:
  • dynamic - TBB auto_partitioner. It performs sufficient splitting to balance load.
  • affinity - TBB affinity_partitioner. It improves auto_partitioner's cache affinity by its choice of mapping subranges to worker threads compared to
  • static - TBB static_partitioner. It distributes range iterations among worker threads as uniformly as possible. TBB partitioner relies grain-size to control chunking. Grain-size is 1 by default, indicating every work-group can be executed independently.
Default:
dynamic
DPCPP_CPU_NUM_CUS
Set the numbers threads used for kernel execution.
To avoid over subscription, maximum value of
DPCPP_CPU_NUM_CUS
should be the number of hardware threads. If
DPCPP_CPU_NUM_CUS
is 1, all the workgroups are executed sequentially by a single thread and this is useful for debugging.
This environment variable is similar to OMP_NUM_THREADS variable used by OpenMP.
Default:
Not set. Determined by TBB.
DPCPP_CPU_PLACES
Specify the places that affinities are set. The value is { sockets | numa_domains | cores | threads }.
This environment variable is similar to the OMP_PLACES variable used by OpenMP.
If value is numa_domains, TBB NUMA API will be used. This is analogous to OMP_PLACES=numa_domains in the OpenMP 5.1 Specification. TBB task arena is bound to numa node and SYCL nd range is uniformly distributed to task arenas.
DPCPP_CPU_PLACES
is suggested to be used together with
DPCPP_CPU_CU_AFFINITY
.
Default:
cores
See the Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference for more information about all supported environment variables.

Example 1: Hyper-threading Enabled

Assume a machine with 2 sockets, 4 physical cores per socket, and each physical core has 2 hyper threads.
  • S<num>
    denotes the socket number that has 8 cores specified in a list
  • T<num>
    denotes the TBB thread number
  • "-" means unused core
DPCPP_CPU_NUM_CUS=16 export DPCPP_CPU_PLACES=sockets DPCPP_CPU_CU_AFFINITY=close: S0:[T0 T1 T2 T3 T4 T5 T6 T7] S1:[T8 T9 T10 T11 T12 T13 T14 T15] DPCPP_CPU_CU_AFFINITY=spread: S0:[T0 T2 T4 T6 T8 T10 T12 T14] S1:[T1 T3 T5 T7 T9 T11 T13 T15] DPCPP_CPU_CU_AFFINITY=master: S0:[T0 T1 T2 T3 T4 T5 T6 T7] S1:[T8 T9 T10 T11 T12 T13 T14 T15] export DPCPP_CPU_PLACES=cores DPCPP_CPU_CU_AFFINITY=close : S0:[T0 T8 T1 T9 T2 T10 T3 T11] S1:[T4 T12 T5 T13 T6 T14 T7 T15] DPCPP_CPU_CU_AFFINITY=spread: S0:[T0 T8 T2 T10 T4 T12 T6 T14] S1:[T1 T9 T3 T11 T5 T13 T7 T15] DPCPP_CPU_CU_AFFINITY=master: S0:[T0 T1 T2 T3 T4 T5 T6 T7] S1:[T8 T9 T10 T11 T12 T13 T14 T15] export DPCPP_CPU_PLACES=threads DPCPP_CPU_CU_AFFINITY=close: S0:[T0 T1 T2 T3 T4 T5 T6 T7] S1:[T8 T9 T10 T11 T12 T13 T14 T15] DPCPP_CPU_CU_AFFINITY=spread: S0:[T0 T2 T4 T6 T8 T10 T12 T14] S1:[T1 T3 T5 T7 T9 T11 T13 T15] DPCPP_CPU_CU_AFFINITY=master: S0:[T0 T1 T2 T3 T4 T5 T6 T7] S1:[T8 T9 T10 T11 T12 T13 T14 T15] export DPCPP_CPU_NUM_CUS=8 DPCPP_CPU_PLACES=sockets, cores and threads have the same bindings: DPCPP_CPU_CU_AFFINITY=close close: S0:[T0 - T1 - T2 - T3 -] S1:[T4 - T5 - T6 - T7 -] DPCPP_CPU_CU_AFFINITY=close spread: S0:[T0 - T2 - T4 - T6 -] S1:[T1 - T3 - T5 - T7 -] DPCPP_CPU_CU_AFFINITY=close master: S0:[T0 T1 T2 T3 T4 T5 T6 T7] S1:[]

Example 2: Hyper-threading Disabled

Assume a machine with 2 sockets, 4 physical cores per socket, and each physical core has 2 hyper threads.
  • S<num>
    denotes the socket number that has 8 cores specified in a list
  • T<num>
    denotes the TBB thread number
  • "-" means unused core
export DPCPP_CPU_NUM_CUS=8 DPCPP_CPU_PLACES=sockets, cores and threads have the same bindings: DPCPP_CPU_CU_AFFINITY=close: S0:[T0 T1 T2 T3] S1:[T4 T5 T6 T7] DPCPP_CPU_CU_AFFINITY=spread: S0:[T0 T2 T4 T6] S1:[T1 T3 T5 T7] DPCPP_CPU_CU_AFFINITY=master: S0:[T0 T1 T2 T3] S1:[T4 T5 T6 T7] export DPCPP_CPU_NUM_CUS=4 DPCPP_CPU_PLACES=sockets, cores and threads have the same bindings: DPCPP_CPU_CU_AFFINITY=close: S0:[T0 - T1 - ] S1:[T2 - T3 - ] DPCPP_CPU_CU_AFFINITY=spread: S0:[T0 - T2 - ] S1:[T1 - T3 - ] DPCPP_CPU_CU_AFFINITY=master: S0:[T0 T1 T2 T3] S1:[ - - - - ]

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.