You conceal one of the most important pieces of information from us, your typical loop lengths. If you are concealing that also from the compiler, it can easily make bad decisions. The standard compiler assumption, where no clear information is present in source code, is of a loop length 100 (with code suitable also for several times that length). Many of the ICL loop optimizations aren't suitable for shorter loop lengths, while VC9 doesn't bother optimizing for longer loop lengths. How's that for a generalization almost as flagrant as yours?
_intel_new_memset() and _intel_new_memcpy() contain branches to optimize several different cases of CPUS, alignments, and loop lengths. If you were to write all those cases into your source code, you would likely lose instruction cache locality, and lose time with all the selections if your loops are never long enough to require long loop optimizations.
It's dead simple to write artificial cases where these special functions will beat VC9 by a big margin, but those cases may be nothing like your application.
ICL should avoid the automatic memset and memcpy substitutions when you place multiple array moves in a loop.
It's easily possible for VC9 to match ICL performance when there is no benefit from vectorization. You have available ICL flags to match your VC9 flags; ICL /O1 /fp:source might come close to matching CL /O2 /fast.
The occasional better optimization of VC9 for loops with opportunities for loop carried scalar replacement may sometimes be matched by writing scalar replacements into your source code.
VC9 observes parentheses faithfully, while ICL treats them K&R fashion unless you set options such as /fp:source. If you don't rely on the compiler performing algebraic simplification across parentheses, in violation of language standards, the VC9 treatment is superior.