Enforcing Loop Vectorization with Array Notations

Enforcing Loop Vectorization with Array Notations

I wonder what would be preventing the compiler from vectorizing the innermost loop in the following function (e.g.): template inline void MatrixVectorProduct(const matrix& m, const std::vector& rhs, std::vector& lhs) { size_t cols = m.cols();
const T* restrict pcol = &(*rhs.begin()); //outer loop (/Qvec-report:3): nonstandard loop is not a vectorization candidate (Fine!) _Cilk_for(size_t i=0; i { const T* prow = &(*(m.begin() + i * cols)); //inner loop(/Qvec-report:3): modifying order of operation not allowed under given switches (?) lhs[i] = __sec_reduce_add( prow[0:cols] * pcol[0:cols] ); } }Under the switches: /O3 /Qstd=c99 /Qopenmp /Qfp-speculation:safe /Qrestrict /arch:SSE2, this function's performance approaches Intel MKL's 'cblas_dgemv()'.Cheers,

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

When I tried this, instantiated with both float and double and using both the Windows and Linux versions of the compiler, the vectorization report I get says that the inner loop is vectorized. What does your matrix class look like? Are you using the released version of the compiler, or a beta, or something else?

- Pablo

I think we may have figured out the trigger here. Assuming you're building out of the IDE, is /fp:precise specified by default? Try changing to /fp:fast if it is.

The question I'm following up on is whether this behavior of the vectorizer makes sense in the context of array notations.

Indeed, I intentionally specify /fp:precise.

After switching to /fp:fast, the loop is vectorized. However, it crashes at runtime with thread/call stack stalled right at the loop.

My matrix class uses contiguous storage and row-major layout. I am using the Intel Composer 2011 XE Update 1 (12.1.127).

Jorge,

If you turn on /W4, do you get any remarks like the following?

remark #18009: A temporary array is allocated to resolve data dependencies

If so, I think you might have a stack overflow caused by some of the array notation code. Let me know - I have an open problem report on this that I can link this thread to.

Brandon,After turning on /W4, I found no remarks. Under /fp:fast the MatrixVectorProduct() function (thread #1) builds and runs.On the other hand, the following function works under /fp:precise (w/o innermost loop vectorization) whereas under /fp:fast the innermost loop is vectorized but it crashes at runtime due to an unhandled access violation. template inline void MatrixProduct(const matrix& m, const matrix& rhs, matrix& lhs) { //assert(...) on all dimensions size_t mcols = m.cols(); size_t ncols = rhs.cols(); const T* pcol = &(*rhs.begin());//restrict pointer candidate _Cilk_for(size_t i=0; i { const T* prow = &(*(m.begin() + i * mcols)); for(size_t j=0; j { lhs[i][j] = __sec_reduce_add(prow[0:mcols] * pcol[j:mcols:ncols]);//acc violation on vect } }}Compiler:/c /O2 /Ob2 /Oi /Ot /Oy /Qipo /I "C:\\Program Files (x86)\\Intel\\ComposerXE-2011\\mkl\\include\\ia32" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /Gy /arch:SSE2 /fp:fast /Fo"Release/" /Fd"Release/vc90.pdb" /W4 /nologo /Zi /Qopenmp /Quse-intel-optimized-headers /Qstd=c99 /Qrestrict /Qvec-report3Linker:mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /INCREMENTAL:NO /nologo /LIBPATH:"C:\\Program Files (x86)\\Intel\\ComposerXE-2011\\mkl\\lib\\ia32" /NODEFAULTLIB:"libcmt.lib" /TLBID:1 /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /DYNAMICBASE /NXCOMPAT /MACHINE:X86Cheers,

Hi Jorge,

This definitely looks like a compiler issue from what you've sent me. The vectorizer is doing something improperly, I think. I've created a problem report for our vectorizer team, and I'll update the thread as their investigation proceeds.

Brandon,I think I've found an answer to our follow-up question in a related article at:http://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/It seems that the behavior makes sense on this context due to the fact that the /fp:precise model allows only value-safe optimizations. The reduction loop in __sec_reduce_add() implies sums reassociation, making it value-unsafe.Question remains on why it does fail under /fp:fast though.Regards,

Hi Jorge,

Correct. Because /fp:precise is specified, the compiler can't safely vectorize the array notation reduction. However, the code crashing after vectorization is still an issue it seems to me.

Brandon,I agree. A very important one indeed.I look forward to hearing from that.Cheers,

Hi Jorge,

We've put a fix in on update 3 for this issue. Try update 3, and let me know if you still have problems.

Leave a Comment

Please sign in to add a comment. Not a member? Join today