daxpby doesn't scale

daxpby doesn't scale

I've discovered a strange performance problem: replacing a daxpy() call with daxpby() significantly degrades program scalability. I've written a simple test case (hopefully attached) which shows that, while daxpy() performance increases as I increase the number of threads, daxpby() performance remains completely unchanged. It's as if daxpby() isn't parallelized!

I'm using dense, aligned, million-element vectors of doubles. The vectors are uninitialized, but that shouldn't matter, and I'm using the same vectors for both calls.

I see the same behavior with mkl_2013.0.079 and mkl_2013_sp1.1.106.

Here's my program's output:

Threads=1; 1450.095253 daxpy() calls/sec                                                                                                                                               
Threads=1; 1391.271988 daxpby() calls/sec                                                                                                                                              
Threads=2; 2810.048711 daxpy() calls/sec                                                                                                                                               
Threads=2; 1387.726834 daxpby() calls/sec                                                                                                                                              
Threads=3; 4056.252211 daxpy() calls/sec                                                                                                                                               
Threads=3; 1371.165840 daxpby() calls/sec                                                                                                                                              
Threads=4; 5288.407756 daxpy() calls/sec                                                                                                                                               
Threads=4; 1390.385973 daxpby() calls/sec                                                                                                                                              
Threads=5; 6248.746533 daxpy() calls/sec                                                                                                                                               
Threads=5; 1394.361390 daxpby() calls/sec                                                                                                                                              
Threads=6; 7418.605490 daxpy() calls/sec                                                                                                                                               
Threads=6; 1388.381384 daxpby() calls/sec                                                                                                                                              
Threads=7; 8796.486642 daxpy() calls/sec                                                                                                                                               
Threads=7; 1384.350895 daxpby() calls/sec                                                                                                                                              
Threads=8; 9896.086698 daxpy() calls/sec                                                                                                                                               
Threads=8; 1387.321742 daxpby() calls/sec                                                                                                                                          


Note that the number of daxpy() calls per second increases as the number of threads increases, while the daxpby() throughput remains constant.

Downloadtext/x-csrc daxpby_bug.c986 bytes
5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Thanks, Peter, for the reproducer and for bringing this matter to our attention.  You are right – daxpy is parallelized, while daxpby is not.  We will look into this matter.  Any further information you can provide as to use cases (number of elements, number of threads, platforms used, etc.) would be helpful.  Thank you!

Hi Murat,

I've tested this on a Linux Xeon E5-2660 system, a 3.4.63 Linux Xeon X5690 system, and a Xeon Phi coprocessor. All three demonstrated the same problem.

I believe that I can work around the problem by chunking the vector manually, iterating over the chunks in an OpenMP parallel for loop, and calling daxpby for each chunk.

In general, we expect daxpby performance to be identical to daxpy performance ​when b=1.0, and we expect the performance to scale similarly for b != 1.0. (In fact, I'm a bit surprised that daxpy isn't just a wrapper for daxpby.) 

Hi Peter,

Thank you for the additional information! Your expectations are reasonable. We will consider parallelization of daxpby in our next planning meeting.

Hi Peter,

please check the threaded implementation into the latest versions of  MKL 11.2 and 11.1 Update 4 which were released recently.



Leave a Comment

Please sign in to add a comment. Not a member? Join today