MKL is slow comparing with own functions. Seems it is not vectorized or parallelized

Hello, everybody.

I decided to compare MKL functions with my own implementations and I was wondered, when I found, that MKL is much slower. I will post my code below and may be some of you could help me to find, what I did wrong. I'm searching already for 3 days but the answers what I found just helped partially. Several questions are remaining.

-------------------------- Problem N 1 v3=a*v1+v2  -------------------------------------

First of all, I found, that cblas_zaxpy is several times slower:
The code is

MKL part:

different results for LAPACKE_dgels between MKL and LAPACKE

Hi guys,

I wrote some code to do dgels originally using LAPACKE. I also tried to use MKL to do the dgels. From my understanding, with the exact same code base but just recompile/relink with MKL library would just do the job. Or can also work upon the test program built with LAPACKE. the programs run OK, but the final results are different.

Any one can tell me why? 

Below is the the test program source code as well as the MakeFile lines.


tbb 4.4, osx, OpenSubdiv - crash


Hi - my build of Opensubdiv with tbb 4.4 (actually - same for 4,3) most of the time ends up with crash (at start):


Exception Codes: KERN_INVALID_ADDRESS at 0x00007ff6e57ffff8

VM Regions Near 0x7ff6e57ffff8:
    MALLOC_TINY            00007ff6e5000000-00007ff6e5100000 [ 1024K] rw-/rwx SM=PRV  
    MALLOC_SMALL           00007ff6e5800000-00007ff6e6000000 [ 8192K] rw-/rwx SM=PRV  

VTune Amplifier XE hung up a VM when doing advanced-hotspots profiling

I was trying to do an advanced-hotspots profiling on a VM (ESXi 6.0, Guest OS: Linux) using VTune Amplifier XE 2016. VTune 2016 works well on bare-metal machines (the linux distro), but gets the whole the VM hang up. The command line I used is:

amplxe-cl --collect  advanced-hotspots -knob sampling-interval=5 -duration 120


amplxe-cl --collect  advanced-hotspots -knob sampling-interval=5 -duration 120 --target-pid 6268

I heard that VTune collector now support driverless capture using perf in Version 2016. Is there anything wrong with my commands for VM?

how to make tbb::flow::async_node a tbb::flow::serial node ?

I want to use a tbb::flow::async_node to execute a function in a dedicated thread.  I have a tbb::flow::sequencer_node sending messages (in order) to an async_node but the async_node lambda function gets called in no specific order!  There does not seem to be a way to specify that the async node is tbb::flow::serial.  So the submit call to the async activity is not sequential... 

Is there a way to make an async_node submitting its messages in a tbb::flow::serial fashion ?

Thanks !


Team invalidation between consecutive parallel constructs

We are doing some experiments with the EPCC parallel benchmark on an Intel Xeon Phi coprocessor 7120 with 244 threads, compact affinity, hierarchical barrier, KMP_LIBRARY=turnaround, KMP_BLOCKTIME=infinit.

Using VTune, I see that most of the non-waiting time is consumed in the __kmp_hierarchical_barrier_release which makes sense to me. However, inside this function, most of the time is spent in:

Iscriversi a Threading