I'm confused by a passage in the Intel Architecture Optimization Manual about load latencies:
22.214.171.124 L1 DCache - Loads
The common load latency is five cycles. When using a simple addressing mode, base plus offset
that is smaller than 2048, the load latency can be four cycles.
Data Type/Addressing Mode Base + Offset > 2048; Base + Offset < 2048
Base + Index [+ Offset]
Integer 5 4
MMX, SSE, 128-bit AVX 6 5
X87 7 6
256-bit AVX 7 7
I'm not sure how to interpret this. Adding some parentheses for clarity, is the faster case ((Base + Offset) < 2048), a condition that user code is unlikely to achieve, or (Base + (Offset < 2048)), something that can often be accomodated?