I've been experimenting with the summary statistics library for computing correlation matrices. I'm using a large sample matrix (10000x6000) of random double values. I've found that the VSL_SS_METHOD_FAST algorithm does a reasonable job of quickly computing the correlation matrix. However, the algorithm does not seem to use more than 6 threads, no matter how I tweak it. I assume that this large matrix could benefit from using more threads, but even on the Intel Phi I can't persuade MKL to use more than 6.
(I've determined that 6 threads are in use both through observing my test case with top, and from setting KMP_AFFINITY=verbose and observing the output.)
I call mkl_set_num_threads(240) and mkl_set_dynamic(0) before I run my test case. After the test case, I compute the dot product of a large vector with itself using ddot. This demonstrates that MKL is able to use 240 threads - it's only the VLSS functions which seem to be restricted to 6 threads.
I've tested that the problem occurs with both VSL_SS_METHOD_FAST and VSL_SS_METHOD_1PASS, and with both VSL_SS_MATRIX_STORAGE_ROWS and VSL_SS_MATRIX_STORAGE_COLS.
How can I fully exploit available processors (particularly on the Phi) to compute correlation matrices?
(I've tried attaching a test case to this, but I keep getting an AJAX HTTP error 550 from the forum.)