I just saw that my cases using _mm256_loadu_ps show better performance than _mm_loadu_ps on corei7-4, where the latter was faster on earlier AVX platforms (in part due to the ability of ICL/icc to compile the SSE intrinsic to AVX-128).
Does this mean that advice to consider AVX-128 will soon be of only historical value? I'm ready to designate my Westmere and corei7 linux boxes as historic vehicles.
icc/ICL 14.0.1 apparently corrected the behavior (beginning with introduction of CEAN) where run-time versioning based on vector alignment never took the (AVX-256) vector branch in certain cases where CEAN notation produced effective AVX-128 code. It seems now that C code can match performance of CEAN, if equivalent pragmas are applied.
A key to getting an advantage for AVX-256 on corei7-4 appears to be to try reduced unroll. In my observation, ICL/icc don't apply automatic unrolling to loops with intrinsics, while gcc does. When not using intrinsics with ICL, I found the option 'ICL -Qunroll2' helpful. ICL used to unroll insufficiently; now it tends to unroll excessively by default for corei7-4 but probably OK for earlier CPUs.
gcc equivalent '-unroll-loops --param max-unroll-times=2'
Hoping to use last year's "VecAnalysis Python Script..." to see differences between CEAN and C with pragmas:
icl -O3 -Qstd=c99 -Qopenmp -Qansi_alias -QxHost -Qunroll2 -Zi -Qvec-report7 -c loopsv.c 2>&1 | ../vecanalysis/vecanalysis.py --annotate
reports one of the cases where CEAN vectorizes as including 1 heavy-overhead [due to variable stride] and 4 lightweight vector operations, and the C code as not vectorized (but performing better, up to loop count 1000).