Using micnativeloadex, how to use args ?


I am a very beginner with the Intel Phi, and I am trying to do something that is maybe not possible.

I have this binary file called upcDemo

And I run it this way: upcrun -n 12 upcDemo 

(This will run the program on 12 threads)

I have tried many syntax with micnativeloadex...but I got errors, here is what I tried:

Cross-thread Stack Access

Hi All,

I'm running an OpenMP application (minimal example) :

     DO KR=1,100

            Array_Out(KR) = Array_Input(KR)



I verified that I do not have data race (I get the same results with only one thread), however when running the application in the Inspector XE 2013 - I get a massage that I have cross-thread stack accesses. 

How can I prevent this behavior, and what is the practical effect if not on the results ?

Thanks in advance for your replies,

Multidimensional DFT and OpenMP

I'm working on a program that performs several 3 x 3d (N1xN2xN3) DFTs using the MKL DFT algorithm. I'm running most of the program in parallel using OpenMP and I'd like to get as much parallel performance from the DFT section as well as it accounts for a significant portion of the programs runtime. However when I try to increase the number of threads I find that the performance improvement plateaus at 3 threads, i.e., the number of transforms for each call. If instead I break up the transform into 3xN1 2d transforms the parallel performance continues to scale beyond 3 threads.

Intel vtune is very slow in finalizing results(linux)


I'm using intel vtune amplifier 2015(linux version). my sample time of the work load is 180 seconds. I gave my SW build with debug symbols enabled. 

In vutune->project properties, I gave the path for the build and the source files and symbols. When i give re-resolve, vtune takes more than 1 hour to finalize and display results. The progress bar goes to 30% and remains stuck there and it says "finalizing results " for more than an hour.

What is the problem here. why does it take so long to display results when i hit re-resolve?

Blocks of different sizes in ScaLAPACK?

I am performing a Cholesky factorization with Intel-MKL, which uses ScaLAPACK. I distributed the matrix, based on this example, where the matrix is distributed in blocks, which are of equal size (i.e. Nb x Mb). I tried to make it so that every block has it's own size, depending on which process it belongs, so that I can experiment more and maybe get better performance.

IRET Pseudo-code Bug


I believe that there is a documentation bug in the pseudo-code for the IRET instruction in the current edition of Volume 2A of the Architectures Software Developers' Manual.

The case we're looking at is using IRET to switch from Ring-0 to Ring-3.

The prose for protected mode states:

Assine o Vetorização