AFAIK and in order to invoke MKL's cblas_dgemm I need three different matrices namely A,B,C:
C = alpha*A*B + beta*C
To compute A=A*B I seem to have to allocate a separate copy of A into C which is a great inconvenience for the algorithm I'm trying to implement (blocked accumulated Givens). This API constraints forces me to continuosly copy patches out of the main matrix and apply the transformation and copy the patch back to the matrix. So I am afraid that the gain due to the BLAS3 and blocking: parallellism. locality, vectorization of the MMM might not pay off due to the many copying around I'm forced to by using MKL's cblas_dgemm.
Is there a way to circumvent this? when I am done with it, I plan to post the algorithm for possible feedback on how to improve it.
The algorithm takes as input a matrix in upper hessenberg form with 1-off diagonal set of elements that are non-zero and need to be annihilated. Instead of invoking dlartg and dlasr in a loop (very bad locality and does not scale well w.r.t. to the problem sizes) I try to work on top of blocks of size NB patches over the affected diagonal, apply plane rotation on top of the NB block diagonal and at the same time accumulate the rotations into a Q orthogonal matrix which is then applied at once to the row band next to the main working patch seating on the diagonal. Hope is somewhat clear.