According to Agner Fog's instruction tables and my own tests, the `pause` instruction is much more expensive in terms of reciprocal throughput on Skylake-X CPUs than on the previous generations (I tested on Sandy Bridge and Broadwell). The difference is an order of magnitude. Agner Fog lists the following reciprocal throughput for the `pause` instruction:
Sandy Bridge: 11 clocks
Haswell, Broadwell: 9 clocks
Skylake-X: 141 clocks
My own tests show the following numbers:
Sandy Bridge: ~24 clocks
Broadwell: ~12 clocks
Skylake-X: ~389 clocks
The exact numbers are not so important to me, and my measurement most probably includes the overhead from the test loop itself. What is important is that on Skylake-X `pause` is way more expensive than on the other architectures, to the point it makes me question whether I should revise my use of this instruction in various spin loops. So my questions are as follows:
1. Is it still sensible to use `pause` in tight spin loops in order to improve performance of hyperthreads?
2. Given the cost difference, should programmers account for the cost difference of `pause` in the number of spin iterations? For instance, if I account that a particular spin loop does not exceed ~500 clocks, which is a typical estimate of a context switch on Linux, should I calculate the number of spin iterations from the actual cost of `pause`?
3. If p.2 is true, are applications expected to benchmark `pause` before deciding on the number of spin iterations? What are the best practices regarding such benchmarking?
4. Is the Skylake-X order of costs considered "normal" and expected to remain the same in future generations? Or is it maybe a CPU bug that is expected to be fixed in the future? I understand that it is not officially known what will be implemented in future products, but general position on the issue is also of interest.