__private memory that is allocated to registers is typically very efficient to access. If the private memory doesn’t fit in registers, however, the performance can be very poor. Since each work-item has its own spill space for __private memory, there is no locality for __private memory accesses, and each work-item frequently accesses a unique cache line for every access to __private memory. For this reason, accesses to __private memory data that has not been allocated to registers are very slow. In most cases, the compiler can map statically-indexed private arrays into registers. Also, in some cases, it can map dynamically-indexed private arrays in registers, but the performance of this code will be slightly lower than accessing statically indexed private arrays. As such, a common optimization is to modify code to ensure private arrays are statically indexed.
For more complete information about compiler optimizations, see our Optimization Notice.