I often need to calculate the sum of a set of matrices or submatrices of a dataset. Unfortunately the two matrices do not always have the same stride, when I am selectively using a subset of a large dataset, which means I have to resort to calculating the sum by hand (alternatively, I could call vkadd or similar once per row, I'm not sure how much overhead this implies when calling vkadd 500 or 1000 times for a 500x500 matrix).

I am aware of the mkl_?omatadd function, but the documentation states that the input and output arrays cannot overlap, which means I would need an extra temporary matrix. While I would assume calculating A = A + m * B works inplace when not transposing matrices, unless this can be guaranteed for all future versions I cannot use that approach.

Are there any other functions which could be used for this calculation I have missed?