OpenMP 4 nested parallelism thread placement

I have a question to ask about some of the environment variables to get thread placement working properly on the phi with nested parallelism. I've looked around the Intel website and various other sources, and talked to some people at Intel, but I can't quite get it working properly.

Essentially, I want to run my code using only 3 threads per core (assuming 60 cores) because testing has shown that it gets the best performance. Without nested parallelism I could get 3 threads per core with these variables:

[LINUX] icc does not recognize #define switch


I am trying to compile an Android Kernel and got an error, which seems to be a fault of Intel ICC compiler semantics.

#define for_each_cpu_worker_pool(pool, cpu)				\
	for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0];		\
	     (pool) < &per_cpu(cpu_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \

Where in a code file we have this piece then:

Graph with multiple roots, multiple leaves


How do we implement an acyclic dependency graph which has multiple roots and/or multiple leaves?

EG if we take the graph in  https://software.intel.com/en-us/node/506110  without link X0,1-X0,2 we get a tree with two roots.

If we remove link X2,0-X2,1 we get a tree with two leaves. 

What does the code for these two graphs look like?



how to compile code link with mkl in windows

Hi friends,

In my windows,I compile the code link with mkl,but it still not work.

>"C:\Program Files (x86)\Intel\Composer XE 2015\bin\compilervars.bat" intel64

>"C:\Program Files (x86)\Intel\Composer XE 2015\mkl\bin\mklvars.bat" intel64

>icl mycode.cpp /Qmkl

.........cannot find fftw3.h.......................

most according to to this page:https://software.intel.com/en-us/articles/intel-mkl-111-getting-started

I used the mkl 11.2


uOS Build


During OS programming for the PHI, the need exists to permanently install RPM's.

Is there a systematic solution for determining and generating changes required to the uOS on a per RPM basis?

Seems like the process would involve an initial analysis of the RPM for contents.

Then the appropriate modifications to the PHI filelist need implemented.

Any thoughts on additional requirements?



One option that I've overlooked is to use an NFS share as the root for the PHI.

Performance issues with Intel MPI (RMA Put/Get) on Xeon Phi

I'm getting bad performance with MPI_Put (and MPI_Get) in IMB-RMA All_put_all microbenchmark on this system configuration:

  • Single and multiple Xeon Phi coprocessors
  • Intel MPSS 3.5.1 (June 2015), Linux
  • Intel MPI Library
  • OFED-3.12-1 or OFED-3.18-rc3 (It doesn't really matter.)

Intel MPI runtime environment variables:

vtune remote analysis error


I setup Vtune on windows 8 to run the experiment using ssh on a linux server.

When i try to create an analysis the following message error appears:

"remote analysis error "detected Vtune Amplifier build #403110 on target system is incompatible with the build #410668 on the host". Package update on target is required.

Amplifier cannot detect remote machine configuration

What shoud i do to solve this incompatibility?




offload error: cannot release buffer memory on device 0 (error code 14)

I need your help. 

I tried to run K-means algorithm on Xeon Phi by using offload mode.

But when i tried to get into offload region with the clause '#pragma offload ~~ (as attached pic 1) ' ,

i got an erorr 'offload error: cannot release buffer memory on device 0 (error code 14)' .

I have no idea to solve this problem, and i even cannot find any previous example similar to my problem on google.

I saw offload report by using 'export OFFLOAD_REPORT=3', but i couldn't get any hints. 

plz help me !


TaeHyeok, Jang

Subscribe to Optimization