A few new public information about the 1st AVX incarnation in Sandy Bridge in the SF09_ARCS002_FIN.pdf document available here :
http://www.intel.com/idf/technology-tracks/ -> "32 nm Implementation of... " -> "See sessions within this track" -> ARCS002 PDF Icon
1) slide 6, one "AVX HIGH" unit on port 0, and "AVX LOW" on port 1 => it looks like 256-bit AVX will have the same throughput than 128-bit SSE on current cores: 2 clocks forone 256-bit vmulps/pd +one 256-bit vaddps/pd instead of 1 clock for one 128-bit mulps/pd + one 128-bit addps/pd (i.e. same peaksp/dp flops per clock with balanced add/mul), so if Sandy Bridge can't issue in the same clock a mul and an add unlike Conroe, Penryn andNehalem it will be actually less efficient than these previous cores witha lot of legacy 128-bit SSE code ?, it looks rather odd => can someone in the know confirm this ?
2) slide 6, only 128-bit paths from the L1D cache to execution units (I was hoping full featured 256-bit paths), a few consequences :
- the extra load port will help as much legacy 128-bit SSE or 128-bit AVX than 256-bit AVX, same 48 B / clock maximum L1 bandwidth
- loop fission will be probably no more a good optimization if intermediate results are stored in L1D, probably better to overflow the LSD than the L1D, particularly with multiple threads fighting for L1D access
- more incentive to use 64-bit code to have 16 ymm registers instead of 8 to minimize L1D access
3) slide 54, 64 B cache lines (unchanged), so :
- align memory still important (more important than on Nehalem), 1/2 access will incur a cache line split otherwise
4) slide 58, masked moves considered harmful, replace vmaskmovps by vblendvps + vmovaps just like in legacy SSE4 code ?