Intel Xeon Phi - MPI application


I created simple "Hello world" apllication. I tried to run program, as show in this article:

But in result I got following error:

# bash: /opt/intel//impi/ cannot execute binary file

I use system with HOME directory shared between host and card.

What Is the problem? Thanks for help.


Get some problem with global variable declaration

I try use an Intel PHI co-prococessor.But i got some problem with global variable declaration .I decline A,B,C as global variable.But the value of them are equal。Turn out to be,A=5,B=5,C=5.And AA=30.The right AA must be 17.Try to get some help here.Thanks.


#include <stdio.h>
#include <math.h>
#include <omp.h>
#pragma offload_attribute(push,target(mic))
float *A;
float *B;
float *C;
#pragma offload_attribute(pop)

//__attribute__((target(mic))) float *A,*B,*C;

Runtime Design Documentation


I am starting to dig into the runtime source code and I am wondering if there is any information available about its general organization/design. I am mostly interested in the "task"-related topics, for instance how are inter-task dependencies detected, which scheduling algorithms are implemented, and such things.

Thanks in advance.

No speedup with TBB and Cilk Plus sorting algorithms

I cannot get any speedup with <b>TBB</b> and <b>Cilk Plus</b> sorting algorithms on Xeon Phi, namely <pre class="brush:cpp">tbb::parallel_sort()</pre>, <pre class="brush:cpp">cilkpub::cilk_sort_in_place()</pre>, and <pre class="brush:cpp">cilkpub::cilk_sort()</pre>. I have tried to use 2, 4, 16, 61, 122 threads. With the very same program, the speedups on the 16-core Xeon host are excellent. The compiler is the same (Intel 15.0.2), the only difference is the -mmic command line argument and linking against MIC libraries.

_mm256_add_ps crashes program

Hello ,

I am using in my code something like:

int x , y;

float * TempD = (float*) _mm_malloc( N * sizeof(*TempD) ,64 );
__m256  * SIMDTempD = (__m256*) TempD;
__m256  * theX = (__m256*) X;
__m256  * theY = (__m256*) Y;
__m256i * theV = (__m256i*) V;
__m256i * theVoronoi = (__m256i*) Vor;

__m256 Xd ,Yd ,XdSquared ,YdSquared;


and then in a loop:

Assine o Thread