Would I be better off using a MKL dot product call or relying on the ICC to optimise a dot product function

Would I be better off using a MKL dot product call or relying on the ICC to optimise a dot product function

I have some code that is spending most of its time in dot product calls. From a performance perspective would I be better off replacing these dot-product calls witha MKL dot product call or relying on the ICC to optimisethe dot product function? The dot-product code is very simple andcould have restricts put on it. The target CPU supports the SSE4 instructions so can make use of compiler vectorisation.

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

What are the typical size in your tasks?

In principle, if the array sizes aren't large enough to benefit from a combination of vector and threaded parallel reduction, the compiler's in-line optimization could out-perform MKL dot product.
For array sizes around 1000, I would expect similar performance either way. Smaller problems should run faster with the compiler's in-line code.
Unfortunately, with the standards compatibility options such as "icc -fp-model source" compiler optimization of dot product is disabled, so then you would be more likely to consider MKL.
Also, you must take care in how the source code is written so as to enable the compiler to optimize. You may require the source code to be written so as to accumulate in a local scalar, or possibly the use of restrict qualifiers, to eliminate aliasing concerns. A BLAS function call implicitly prevents aliasing.
STL inner_product(), if applicable, eliminates the time which the BLAS function would spend checking which method would be appropriate, as it supports only unity strides.
SSE4 would be needed only for non-unity strides. I don't know whether MKL would implement both unity and non-unity strided vectorized versions (taking additional time to choose among them).

The biggest dataset I'm using in the model code is 1000x40000. The Institute is currently processing 3500x50000 arrays expects to be processing 8000x50000 arrays next month.

In this case, try MKL routines in the first place for such data sets.-Gennady

Thanks for the reply.

Could you give some pointers on what benefits the MKL might deliver when using these large arrays compared to small arraysusing roll-your-own code built with ICC?

Regards

David

Leave a Comment

Please sign in to add a comment. Not a member? Join today