Developer Guide

Avoiding Register Spills

Registers and Performance

It is well known that the register is the fastest storage in the memory hierarchy. Keeping data in registers as long as possible is critical to performance. On the other hand, register space is limited and much smaller than memory space. Current generation of Intel
®
GPUs, for example, has 128 general-purpose registers of 32-byte wide each by default for each EU thread. Though the compiler aims to assign as many variables to registers as possible, the limited number of registers can be allocated only to a small set of variables at any point during execution. A given register can hold different variables at different times because not all variables are needed at the same time and different set of variables are needed at different times. In the case that there are not enough registers to hold all the variables, register can spill, or some variables currently in the registers have to be moved to the memory to make room for other variables.
In DPC++, the compiler allocates registers to private variables in work items. Multiple work items in a sub-group are packed into one EU thread. By default, the compiler uses register pressure as one of the heuristics to choose SIMD width or sub-group size. High register pressures can result in smaller sub-group size, for example, 8 instead of 16, if a sub-group size is not explicitly requested, register spilling or certain variables not to be promoted to registers.
The hardware may not be fully utilized if sub-group size or SIMD width is not the maximum the hardware supports. Register spilling can cause significant performance degradation, especially when spills occur inside hot loops. When variables are not promoted to registers, accesses to these variables incur significant increase of memory traffic.
Though the compiler uses intelligent algorithms to avoid or minimize register spills if spilling is unavoidable, optimizations by developers can help the compiler to do a better job and often make big performance difference.

Optimization Techniques

A few techniques can be applied to reduce register pressures:
  • keep the distance between loading value to a variable and using the variable as short as possible
    Though the compiler schedules instructions and optimizes the distances, in some cases, moving the loading and using of the same variable closer or remove certain dependencies in the source can help the compiler to do a better job.
  • avoid excessive loop unrolling
    Loop unrolling exposes opportunities for instruction scheduling optimization by the compiler and thus can improve performance. However, temporary variables introduced by unrolling may increase pressure on register allocation and cause register spilling. It is always a good idea to compare the performance with and without loop unroll and different times of unrolls to decide if a loop should be unrolled or how many times to be unrolled.
  • recompute cheap-to-compute values on-demand that otherwise would be held in registers for long time
  • avoid big arrays or large structure if possible
  • choose smaller data types if possible
  • sharing registers in sub-group
  • use shared local memory
The list here is not exhaustive.
The rest of this chapter will show how to apply these techniques, especially the last three ones in real examples.

Choosing Smaller Data Types

constexpr int blockSize = 256; constexpr int NUM_BINS = 32; std::vector<unsigned long> hist(NUM_BINS, 0); sycl::buffer<unsigned long, 1> mbuf(input.data(), N); sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS); auto e = q.submit([&](auto &h) { sycl::accessor macc(mbuf, h, sycl::read_only); auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h); h.parallel_for( sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}), [= ](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] { int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; sycl::ONEAPI::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; int sgGroup = sg.get_group_id()[0]; unsigned long histogram[NUM_BINS]; // histogram bins take too much storage to be // promoted to registers for (int k = 0; k < NUM_BINS; k++) { histogram[k] = 0; } for (int k = 0; k < blockSize; k++) { unsigned long x = sg.load(macc.get_pointer() + group * gSize * blockSize + sgGroup * sgSize * blockSize + sgSize * k); #pragma unroll for (int i = 0; i < 8; i++) { unsigned int c = x & 0x1FU; histogram[c] += 1; x = x >> 8; } } for (int k = 0; k < NUM_BINS; k++) { hacc[k].fetch_add(histogram[k]); } }); });
This example calculates histograms with bin size of 32. Each work item has 32 private bins of unsigned long data type. Because of the large storage required, the private bins cannot fit in registers, resulting poor performance.
With
blockSize
256, the maximum value of each private histogram bin will not exceed the maximum value of an unsigned integer. Instead of unsigned long type for private histogram bins, we can use unsigned integers to reduce register pressure so the private bins can fit in registers. This simple change makes significant performance difference.
constexpr int blockSize = 256; constexpr int NUM_BINS = 32; std::vector<unsigned long> hist(NUM_BINS, 0); sycl::buffer<unsigned long, 1> mbuf(input.data(), N); sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS); auto e = q.submit([&](auto &h) { sycl::accessor macc(mbuf, h, sycl::read_only); auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h); h.parallel_for( sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}), [= ](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] { int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; sycl::ONEAPI::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; int sgGroup = sg.get_group_id()[0]; unsigned int histogram[NUM_BINS]; // histogram bins take less storage // with smaller data type for (int k = 0; k < NUM_BINS; k++) { histogram[k] = 0; } for (int k = 0; k < blockSize; k++) { unsigned long x = sg.load(macc.get_pointer() + group * gSize * blockSize + sgGroup * sgSize * blockSize + sgSize * k); #pragma unroll for (int i = 0; i < 8; i++) { unsigned int c = x & 0x1FU; histogram[c] += 1; x = x >> 8; } } for (int k = 0; k < NUM_BINS; k++) { hacc[k].fetch_add(histogram[k]); } }); });

Sharing Registers in Sub-group

Now we increase the histogram bins to 256:
constexpr int blockSize = 256; constexpr int NUM_BINS = 256; std::vector<unsigned long> hist(NUM_BINS, 0); sycl::buffer<unsigned long, 1> mbuf(input.data(), N); sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS); auto e = q.submit([&](auto &h) { sycl::accessor macc(mbuf, h, sycl::read_only); auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h); h.parallel_for( sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}), [= ](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] { int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; sycl::ONEAPI::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; int sgGroup = sg.get_group_id()[0]; unsigned int histogram[NUM_BINS]; // histogram bins take too much storage to be // promoted to registers for (int k = 0; k < NUM_BINS; k++) { histogram[k] = 0; } for (int k = 0; k < blockSize; k++) { unsigned long x = sg.load(macc.get_pointer() + group * gSize * blockSize + sgGroup * sgSize * blockSize + sgSize * k); #pragma unroll for (int i = 0; i < 8; i++) { unsigned int c = x & 0x1FU; histogram[c] += 1; x = x >> 8; } } for (int k = 0; k < NUM_BINS; k++) { hacc[k].fetch_add(histogram[k]); } }); });
With 256 histogram bins, the performance degrades even with smaller data type unsigned integer. The storage of the private bins in each work item is too large for registers.
Each Work Item Has 256 Private Histogram Bins
Each Work Item Has 256 Private Histogram Bins
If the sub-group size is 16 as requested, we know that 16 work items are packed into one EU thread. We also know work items in the same sub-group can communicate and share data with each other very efficiently. If the work items in the same sub-group share the private histogram bins, only 256 private bins are needed for the whole sub-group, or 16 private bins for each work item instead.
Sub-group Has 256 Private Histogram Bins
Sub-group Has 256 Private Histogram Bins
To share the histogram bins in the sub-group, each work item broadcasts its input data to every work item in the same sub-group. The work item that owns the corresponding histogram bin does the update.
constexpr int blockSize = 256; constexpr int NUM_BINS = 256; std::vector<unsigned long> hist(NUM_BINS, 0); sycl::buffer<unsigned long, 1> mbuf(input.data(), N); sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS); auto e = q.submit([&](auto &h) { sycl::accessor macc(mbuf, h, sycl::read_only); auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h); h.parallel_for( sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}), [= ](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] { int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; sycl::ONEAPI::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; int sgGroup = sg.get_group_id()[0]; unsigned int histogram[NUM_BINS / 16]; // histogram bins take too much storage // to be promoted to registers for (int k = 0; k < NUM_BINS / 16; k++) { histogram[k] = 0; } for (int k = 0; k < blockSize; k++) { unsigned long x = sg.load(macc.get_pointer() + group * gSize * blockSize + sgGroup * sgSize * blockSize + sgSize * k); // subgroup size is 16 #pragma unroll for (int j = 0; j < 16; j++) { unsigned long y = broadcast(sg, x, j); #pragma unroll for (int i = 0; i < 8; i++) { unsigned int c = y & 0xFF; // (c & 0xF) is the workitem in which the bin resides // (c >> 4) is the bin index if (sg.get_local_id()[0] == (c & 0xF)) { histogram[c >> 4] += 1; } y = y >> 8; } } } for (int k = 0; k < NUM_BINS / 16; k++) { hacc[16 * k + sg.get_local_id()[0]].fetch_add(histogram[k]); } }); });

Using Shared Local Memory

If the number of histogram bins gets larger to, for example, 1024, there will not be enough register space for private bins even the private bins are shared in the same sub-group. To reduce memory traffic, the local histogram bins can be allocated in the shared local memory and shared by work items in the same work-group. Please continue to the “Shared Local Memory” chapter and see how it is done in the histogram example there.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.