Enforcing Loop Vectorization with Array Notations

Enforcing Loop Vectorization with Array Notations

I wonder what would be preventing the compiler from vectorizing the innermost loop in the following function (e.g.): template inline void MatrixVectorProduct(const matrix& m, const std::vector& rhs, std::vector& lhs) { size_t cols = m.cols();
const T* restrict pcol = &(*rhs.begin()); //outer loop (/Qvec-report:3): nonstandard loop is not a vectorization candidate (Fine!) _Cilk_for(size_t i=0; i { const T* prow = &(*(m.begin() + i * cols)); //inner loop(/Qvec-report:3): modifying order of operation not allowed under given switches (?) lhs[i] = __sec_reduce_add( prow[0:cols] * pcol[0:cols] ); } }Under the switches: /O3 /Qstd=c99 /Qopenmp /Qfp-speculation:safe /Qrestrict /arch:SSE2, this function's performance approaches Intel MKL's 'cblas_dgemv()'.Cheers,

13 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

When I tried this, instantiated with both float and double and using both the Windows and Linux versions of the compiler, the vectorization report I get says that the inner loop is vectorized. What does your matrix class look like? Are you using the released version of the compiler, or a beta, or something else?

- Pablo

I think we may have figured out the trigger here. Assuming you're building out of the IDE, is /fp:precise specified by default? Try changing to /fp:fast if it is.

The question I'm following up on is whether this behavior of the vectorizer makes sense in the context of array notations.

Brandon Hewitt
Technical Consulting Engineer

For 1:1 technical support: http://premier.intel.com

Software Product Support info: http://www.intel.com/software/support

Indeed, I intentionally specify /fp:precise.

After switching to /fp:fast, the loop is vectorized. However, it crashes at runtime with thread/call stack stalled right at the loop.

My matrix class uses contiguous storage and row-major layout. I am using the Intel Composer 2011 XE Update 1 (12.1.127).

Jorge,

If you turn on /W4, do you get any remarks like the following?

remark #18009: A temporary array is allocated to resolve data dependencies

If so, I think you might have a stack overflow caused by some of the array notation code. Let me know - I have an open problem report on this that I can link this thread to.

Brandon Hewitt
Technical Consulting Engineer

For 1:1 technical support: http://premier.intel.com

Software Product Support info: http://www.intel.com/software/support

Brandon,After turning on /W4, I found no remarks. Under /fp:fast the MatrixVectorProduct() function (thread #1) builds and runs.On the other hand, the following function works under /fp:precise (w/o innermost loop vectorization) whereas under /fp:fast the innermost loop is vectorized but it crashes at runtime due to an unhandled access violation. template inline void MatrixProduct(const matrix& m, const matrix& rhs, matrix& lhs) { //assert(...) on all dimensions size_t mcols = m.cols(); size_t ncols = rhs.cols(); const T* pcol = &(*rhs.begin());//restrict pointer candidate _Cilk_for(size_t i=0; i { const T* prow = &(*(m.begin() + i * mcols)); for(size_t j=0; j { lhs[i][j] = __sec_reduce_add(prow[0:mcols] * pcol[j:mcols:ncols]);//acc violation on vect } }}Compiler:/c /O2 /Ob2 /Oi /Ot /Oy /Qipo /I "C:\\Program Files (x86)\\Intel\\ComposerXE-2011\\mkl\\include\\ia32" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /Gy /arch:SSE2 /fp:fast /Fo"Release/" /Fd"Release/vc90.pdb" /W4 /nologo /Zi /Qopenmp /Quse-intel-optimized-headers /Qstd=c99 /Qrestrict /Qvec-report3Linker:mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /INCREMENTAL:NO /nologo /LIBPATH:"C:\\Program Files (x86)\\Intel\\ComposerXE-2011\\mkl\\lib\\ia32" /NODEFAULTLIB:"libcmt.lib" /TLBID:1 /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /DYNAMICBASE /NXCOMPAT /MACHINE:X86Cheers,

Hi Jorge,

This definitely looks like a compiler issue from what you've sent me. The vectorizer is doing something improperly, I think. I've created a problem report for our vectorizer team, and I'll update the thread as their investigation proceeds.

Brandon Hewitt
Technical Consulting Engineer

For 1:1 technical support: http://premier.intel.com

Software Product Support info: http://www.intel.com/software/support

Brandon,I think I've found an answer to our follow-up question in a related article at:http://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/It seems that the behavior makes sense on this context due to the fact that the /fp:precise model allows only value-safe optimizations. The reduction loop in __sec_reduce_add() implies sums reassociation, making it value-unsafe.Question remains on why it does fail under /fp:fast though.Regards,

Hi Jorge,

Correct. Because /fp:precise is specified, the compiler can't safely vectorize the array notation reduction. However, the code crashing after vectorization is still an issue it seems to me.

Brandon Hewitt
Technical Consulting Engineer

For 1:1 technical support: http://premier.intel.com

Software Product Support info: http://www.intel.com/software/support

Brandon,I agree. A very important one indeed.I look forward to hearing from that.Cheers,

Hi Jorge,

We've put a fix in on update 3 for this issue. Try update 3, and let me know if you still have problems.

Brandon Hewitt
Technical Consulting Engineer

For 1:1 technical support: http://premier.intel.com

Software Product Support info: http://www.intel.com/software/support

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi