Hi!
Currently auto-vectorization will become enabled only if WorkSize is a multiple of 4. Is it maybe possible to implement optimization which allows vectorization up to (WorkSize/4 * 4) and does only the remainder as scalar? (and/or only the begining thus removing also the constraint for 256 byte alignment).
Thanks!
Atmapuri


