I'm porting some Matlab code to C to both make it standalone and try to get it running faster. The computational bottleneck is 1000x1000 double-precision complex matrix-vector mults.
So far I'm using MKL's cblas_zgemv with column-ordered matricies and compiling with gcc-4.0 and comparing with Matlab. MKL is 30-50% slower than Maltab. Matlab uses MKL under the hood as best I understand, and some people I've talked with say Matlab customizes their MKL implementation in some way, it runs parallelized, and it can't be beat. Is that so? Can I get closer to matching it?
How can I tell that Matlab's implementation is running parallelized on my Core 2 Duo (OS X 10.5)? If I look at OSX's activity monitor, the CPU reading's don't look like both are being maxed at the same time, it looks like deamnd between them is trading back and forth.
If I understand the MKL docs correctly, zgemv is not parallelized so I shouldn't expect to see it running parallel. Is that correct? I suppose I could manually parallelize it using intel TBB by processing the matrix as two or more chunks? It's something to try unless there's an obvious problem with that like thread overhead.
The next thing on my list to try is the ICC 11.1 compiler instead of gcc-4.0 in case that effects the MKL lib's performance in some way.
Any other suggestions?
Also, how do I properly compare speed of Matlab vs MKL in C? I'm writing C mex files to use MKL from Matlab. I have some simple scripts to test timing on the mat-vec mults, in which I perform a mat-vec mul N times in ML (I've tried both in a for loop and with each iteration manually "unrolled" in the script - no time difference), and then N times in my mex routine, and compare the times calc'ed by Matlab's tic/toc functions. In this routine, MKL via C mex is about 2x faster than Matlab. But in my full-on app Matlab still runs faster for the mat-vec mul portions, again using matlab's tic/toc. When I skip the mat-vec mul portions of my app (leaving mainly vector math), my C mex code run 7x faster than Matlab. I imagine there are some issues with Matlab running less efficiently in certain scripts, or more efficiently in certain memory/cahcing situation, but I'm wondering overall if there's some better approach to comparing timing.