/O1 disables vectorization, since ICL 10.0; I mentioned that in case your loops are too short for vectorization to be useful. In version 9.1, /O1 vectorized, but without extra unrolling, thus giving vector performance on shorter loops than /O2 did.
ICL vectorization typically takes loop iterations in groups of 8, with adustments for 16-byte alignment before and after. It doesn't often pay off for loops of length less than 16 plus the adjustments, and you will see performance peaking for loop lengths at intervals of 8.
In typical C or C++ code, unless arrays are declared with fixed size local to the function, it's nearly impossible for the compiler to pick up information to change the default assumption that optimization should be for loop length 100.
If you know that no alignment adjustment is required at the beginning of the loop to make all data 16-byte aligned, but it's not visible to the compiler,
#pragma vector aligned
should speed up the loop, but it will break if your assertion is wrong. This pragma also over-rides the compiler's cost/benefit analysis where it decides whether vectorization should gain.
#pragma no vector
would prevent vectorization of a loop.
Vectorization of loops of length 60 to 3000 should more than double the performance. When combined with OpenMP or similar parallelization, the combined gain is better on the current Core i7 or Xeon 5500 CPUs than on the earlier ones. Still, it is common to find a loop of length 1000 where either vectorization or parallelization gives good speedup, but there is no use in combining the optimizations, unless the parallelization can take place at a higher level.