The current Intel auto vectorization works only, if the GlobalWorkSize is divisable with the length of the vector to be vectorized. This is a similar restriction to the memory alignment. Although it is not difficult to allocate aligned vector it can easily cause an overhead of an additional copy operation thus nulifying the possible gains due to proper alignment and increasing memory usage. Please consider the following enhancement with regard to this issue:
1.) Process the elements of unaligned array up to the point of alignment and use faster code from there on
2.) Process elements after the final aligned address with slower code.
I tried to do that manually and have achieved pretty good timings in compare to the aligned address and size, but the code is long and tedious to write within the same kernel.