Intel® Many Integrated Core Architektur

No speedup with TBB and Cilk Plus sorting algorithms

I cannot get any speedup with <b>TBB</b> and <b>Cilk Plus</b> sorting algorithms on Xeon Phi, namely <pre class="brush:cpp">tbb::parallel_sort()</pre>, <pre class="brush:cpp">cilkpub::cilk_sort_in_place()</pre>, and <pre class="brush:cpp">cilkpub::cilk_sort()</pre>. I have tried to use 2, 4, 16, 61, 122 threads. With the very same program, the speedups on the 16-core Xeon host are excellent. The compiler is the same (Intel 15.0.2), the only difference is the -mmic command line argument and linking against MIC libraries.

_mm256_add_ps crashes program

Hello ,

I am using in my code something like:

int x , y;

float * TempD = (float*) _mm_malloc( N * sizeof(*TempD) ,64 );
__m256  * SIMDTempD = (__m256*) TempD;
__m256  * theX = (__m256*) X;
__m256  * theY = (__m256*) Y;
__m256i * theV = (__m256i*) V;
__m256i * theVoronoi = (__m256i*) Vor;

__m256 Xd ,Yd ,XdSquared ,YdSquared;


and then in a loop:

Illegal instruction using _mm512

Hello ,

I am using in my code intrinsics.

If I compile like:

icc -std=c99 -g -openmp -qopt-report=2 -o mycode mycode.c

I am receiving : Illegal instruction in line:

__m512 D = _mm512_set1_ps( FLT_MAX );


If I compile :

icc -std=c99 -g -mavx -openmp -qopt-report=2 -o mycode mycode.c

I am receiving: Illegal instruction in line:

Significant Scalability and Performance Improvement for Intel® MKL PARDISO on SMP Systems

Intel® MKL 11.3 Beta (released in April 2015) contains significant performance and scalability improvements for the direct sparse solver (a.k.a. Intel MKL PARDISO), on SMP systems. These improvements particularly benefit the Intel Xeon Phi coprocessors and Intel Xeon processors with large core counts. As an example, the chart below shows a 1.7x to 2.5x speedup of Intel MKL 11.3 Beta over Intel MKL 11.2, when using the PARDISO to solve various sparse matrices on an Intel Xeon Phi coprocessor with 61 cores.

  • Entwickler
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 10
  • Microsoft Windows* 8.x
  • C/C++
  • Fortran
  • Experten
  • Anfänger
  • Fortgeschrittene
  • Intel Math Kernal Library (Intel MKL)
  • Entwicklungstools
  • Intel® Many Integrated Core Architektur
  • Unknown header type 7f

    I'm running RHEL 7.0 and I the system seems to have a problem talking to the Phi card.

    This is what I see in lspci:

    03:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 5100 series (rev ff) (prog-if ff)

            !!! Unknown header type 7f

            Kernel driver in use: mic


    I've attached the micdebug log.

    Intel Omni-Path Webinar

    The upcoming next-generation Intel Omni-Path Architecture addresses lessons learned, good and bad, from Intel True Scale Architecture and standard InfiniBand*. In an effort to avoid observed pitfalls, Intel approached the architecture of an HPC fabric from a different perspective. The architectures for current products and Intel Omni-Path systems were explicitly developed from the ground up for MPI HPC clusters to bring out the best possible performance.

    Trouble with Updating MPSS

    My server has 4x Intel Xeon Phi 5110P accelerator cards. it runs Centos 6.5 with kernel version 2.6.32-431.29.2.el6.x86_64

    When updating MPSS from 2.1 to 3.3.4 and 3.4.3, I receive the following error:

    [root@XXXXX mpss-3.3.4]# /usr/bin/micflash -update -device all -smcbootloader
    Error getting SCIF driver version
    failed to open mic'0': /sys/class/mic/mic0/family: Knights Corner: not supported: Operation canceled

    failed to open mic'1': /sys/class/mic/mic1/family: Knights Corner: not supported: Operation canceled

    MPSS 3.5

    Please note that the new MPSS 3.5 is just released at


    This new version supports the following OS:


    - Linux: RHEL* 6.4, 6.5, 6.6, 7.0 and 7.1 & SuSE SLES* 11 SP3 and SuSE 12.

    - Microsoft Windows*: Windows* 7 Enterprise SP1, 8/8.1 Enterprise, Server 2008 R2 SP1, Server 2012 and Server 2012 R2.


    Performance scale of the Intel Phi MIC


    The attached is plot of execution time on Intel Phi with varying number of threads. The same program runs in native and offload modes.

    The Phi device has 60 cores.

    1) Why the timing steps don't occur at multiples of number of cores (i.e., multiple of 60s)?

    2) Why the time drops substantially around 248 threads and increases again? (i.e., > 4x60)

    Intel® Many Integrated Core Architektur abonnieren