The Optimization Reference Manual, page 2-19, Table 2-8 (Effect of Addressing Modes on Load Latency) shows that the load latency from the L1 data cache of Sandy Bridge varies from 4 to 7 cycles depending on the data type being loaded and the addressing mode.
http://www.intel.com/Assets/en_US/PDF/manual/248966.pdf#page=55
The Optimization Reference Manual, page 2-16, Table 2-6 (Lookup Order and Load Latency) has a footnote that says "Subject to execution core bypass restriction shown in Table 2-4".
http://www.intel.com/Assets/en_US/PDF/manual/248966.pdf#page=52
1. Does the execution core bypass restriction that causes the load latency to vary by 3 cycles (7-4=3) depending on the data type apply to L2 and LLC or only to L1? The location of the footnote in Table 2-6 suggests it applies only to L1 but someone told me it does apply to L2 and LLC.
2. Does the addressing mode (base+offset with offset<2048 or not) affect the load latency from L2 and LLC?
3. Table 2-8 shows the same load latency for X87 and 256-bit AVX (7 cycles) when offset>2048 but different load latencies (6 or 7 cycles) when offset<2048. Is one of these numbers a typo? If having a large offset with base+offset addressing increases the load latency by one cycle for integer, MMX, SSE, 128-bit AVX and X87, why doesn't it increase the load latency by one cycle for 256-bit AVX?
4. Do two 128-bit AVX loads dispatched on the same cycle have less latency than one 256-bit AVX load? Please explain.
By the way, the second column in Table 2-8 should probably say ">=2048" instead of ">2048". (That's my obsessivecompulsive disorder at work!)


