I need to calculate a matrix crossproduct of the form B = A' * A; This results in a symmetric matrix B, so it should be possible to have the multiplication only calculate the upper or lower triangular matrix B and flip it to fill the second half, thereby saving 50% of the calculation time. However I cannot find a method/option which does this.
I have tried manually implementing this calculation, by multiplying row vectors/blocks of A' by A and storing these in the corresponding blocks of B, however depending on the block size the overhead due to multiple calls can even lead to a decrease in performance (very small blocks) or a gain in performance < 50%.
Alternatively, what would the optimal block size be to reduce the overhead in multiple calls, and spinning up threads? Is any information available on how the algorithm partitions the data into multiple threads internally?