minimum / optimal block size for ScaLAPACK and BLAS?

minimum / optimal block size for ScaLAPACK and BLAS?

ScaLAPACK arrays are distributed in a block-cyclic fashion over the process "grid".  ScaLAPACK then uses the PBLAS and BLACS to perform BLAS-like operations, but in a distributed SPMD fashion, which become a mix of communication between processes, and BLAS operations within the processes, more-or-less.

So the size of the block is going to affect the performance of the communication and the BLAS calls, but the degree to which it does depends on the implementation.  The MKL implementation is a black-box to the end user (me).  And I don't have an ATLAS-like search tool to point me in the right direction toward what block size I should be using, especially when the parameters are things like { Gig-ethernet vs 10G infiniband vs ....} and {westmere vs sandy/ivy-bridge vs haswell } etc.

So... are there any guidelines for choice of block size when using MKL ScaLAPACK, LAPACK, and BLAS ?

  • E.g. is it important for the ScaLAPACK block size to be a cache-friendly size (e.g no larger than 1/2 of L1 or L2, etc)?
  • Or alternatively does the ScaLAPACK block size affect primarily the load balancing as an operation that works on successively smaller areas of a matrix as many of the algorithms do?  But are not relevant to the efficiency of block-matrix multiplies at the BLAS level?
  • Perhaps the MKL Level-3 BLAS calls are themselves made less-sensitive to large block sizes?  (E.g. because there is re-blocking within gemm() etc, anyway where threads are exploited by OpenMP etc ... maybe the MKL BLAS is already subdividing (re-blocking) to be as effcient as it can given that it gets a large enough block?
  • If I want to avoid such hypothesized re-blocking, because for some reason there are places where I can manage this block size "for free" as a side-effect of the way my code is structured, is there an optimal block size for MKL level-3 BLAS calls?

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Sorry for some sloppiness in my writing.  If I knew how to edit my post, I would make these changes:

  • [in bullet 2] "load balancing as an operation" -> of an operation
  • [in bullet 3] "because there is ... OpenMP etc" -> because maybe there is  ... OpenMP etc?
  • [in bullet 4] "because for some reason" -> because if for some reason

For example, in case of an Ivy Bridge system, like:

Intel Core i7-3840QM ( 2.80 GHz )
Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846

Size of L3 Cache = 8MB ( shared between all cores for data & instructions )
Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions )
Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )

an optimal size depends on sizes of these cache lines and you need to take into account Lx sizes for your system.

Also, there was a post recently that in case of a Haswell system a minimal block size for some memory bound processing, like copy from a memory location A to location B, is 1920 bytes ( 64 * 30 ) and it was selected after a series of tests.

Leave a Comment

Please sign in to add a comment. Not a member? Join today