For bandwidth-limited kernels, operating on the data that does
not fit in the last-level cache, the warm-up run does not improve
the stability of measurement significantly.
For a kernel with a small number of instructions executed over
a small data set, make sure there is a sufficient number of iterations,
so that the kernel run time is at least 20 milliseconds for CPU device.
Kernels with smaller run time might provide unreliable data, so
increasing the amount of computations artificially gives you important
insights into the hotspots. For example, you can add loop in the kernel,
or replicate some pieces.