Compile the attached source file with '-O3' and either of '-xSSE4.2' and '-xAVX'. ifort 13.0 vectorizes the k-loop but generates an unneeded scalar version. Since the loop count is 4, in no cases can that scalar version be used.
By the way, the compiler seems to be too aggressive in vectorization. It generates simulated gathers for accesses to the o array. In order to use VPSLLD, it uses three instructions to pack four integers into a vector, then uses another 7 instructions to unpack them into four GPRs. It would have better to just use GPRs from the start and use SHL/LEA instead of VPSLLD.