Hi guys,

I have one part of my code where I'm calling a threaded function, where each thread holds a set of pointers that are allocated inside the function, followed by some operations.

The problem is that the execution time for this function is very slow. To provide a bit more detail, I am instantiating three double arrays before calling the function. These arrays initially only have one dimension allocated, and this dimension is used to assign "chunks" of the three arrays to the respective thread. I am also passing a static two dimensional double array, let's call this data for the following detail:

From here, we get into the function. In this function I'm doing a few things.

1. subtract a *specific row* of data from N different rows of "data", and store this in a new array (vdSub does this without issue) A. Note that A is instantiated *within* the thread.

2. Compute the product AA^T via dgemm. This is stored both locally, and also copied to one of the global double arrays mentioned above (you'll see why below).

3. Compute the minimum-norm solution to the linear-least squares problem defined by AA^T (i.e. pinv(A)*ones(N,1)), and store this solution in one of the three arrays (not the one mentioned in step 2) that are allocated outside the threaded function. This now causes one of the AA^T copies discussed in (2) to contain some junk we don't care about.

4. Compute the eigenvalues via sytrd/stemr on the remaining AA^T matrix, where the eigenvalues and eigenvectors are held in the respective global arrays.

Now admittedly the actual dimension of A can be quite big; A\in \mathbb{R}^{N x K}, where K is usually a scalar multiple larger than N (i.e. K is usually 10 times the size of N) . AA^T is usually of dim no larger than ~300x300 however, each value in AA^T is not simply a scalar value by virtue of the computation.

what I don't understand is how matlab manages to perform this entire set of steps much faster than MKL? When I choose small N, say N=20, it works way faster than MATLAB. However when I choose something like N=100, it is WAY slower. Usually N=20 will result in each function taking no longer than 30 seconds, whereas N=100ish will take _HOURS_ (Note that the parameter K does not change on either run, that is, it's fixed).

For example if K=900, the function only takes 90 seconds per thread if N=20.

However if N=100, the function now takes 10,000 seconds! something is fishy!!

any advice as to what i'm doing wrong? are my optimizations wrong? I'm running a Mac Pro 12 core with the i7 and 128GB ram on OSX. I am quite upset that it is so slow. Are there ways to increase the memory footprint to improve performance? Is it possible I'm using DGEMM in the incorrect manner, and even though it generates the correct output, the incorrect usage leads to a slowdown? again any advice would be great!

here is the compile line:

icpc -g -stdlib=libc++ -std=c++11 -DDEBUG -O3 -no-opt-calloc -vec-guard-write -axssse3 -xssse3 -no-ftz -opt-mem-layout-trans -ansi-alias -parallel -funroll-loops -ipo -mtune=corei7 -o aProgram aProgram.c -I/usr/local/include -I/opt/intel/composerxe/mkl/include -I/usr/local/include -I/usr/include -L/usr/local/fsl/lib -lfslio -L/usr/local/lib -lniftiio -L/usr/lib -L/usr/local/lib -L/opt/intel/composerxe/mkl/lib -openmp -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lznz -lm -lz

running MKL/composer with SP1