Intel® Many Integrated Core Architecture

How to allocation MICs to all the MPI processors equally for AO?

Could you please take a look at this problem? My machine has 16 CPUs and 4 MICs (47 coprocessors each), and I run my program with 8 MPI processors (mpi_comm_size = 8) and want to use MKL routines with automatic offload (AO) mode. As you can see in the test code attached, I tried three different methods.
METHOD-1: I allocate the 4 MICs to the first 4 CPUs each and let the other CPUs run w/o MIC. In this case the program works well as expected and I got the following performance test result when solving zgemm for 5k*5k size of complex & dense matrices.

31S1P problems (MSI-X Enable-, or 4G Decoding, probably)

Hello, everyone.  I've been lurking on the forums for a few days now while I schemed up a cooling solution for my shiny new 31S1P. 

I'm pretty sure I've conquered the cooling requirements.  Check!

However, I cannot get the card to work correctly.  I'm using a Z97-WS motherboard with "4G Decoding" enabled in the BIOS settings. The CPU is a Celeron G1820 which is a cheap little lga1150 socket CPU that seemed to be enough for this rig.  I'm running the latest BIOS (2403, I believe from 2015-06-18 or thereabouts), latest version of CentOS 7.1, which is 7.1.1503 (Core). 

Intel(R) Manycore Platform Software Stack (MPSS) - Long-Term-Support Archive

In this page you will find the last releases of the Intel(R) Manycore Platform Software Stack (MPSS) Long Term Support product (LTS). The most recent release is found here: and we recommend customers use the latest release wherever possible.
  • Developers
  • Professors
  • Students
  • Linux*
  • Microsoft Windows* 10
  • Microsoft Windows* 8.x
  • Advanced
  • Beginner
  • Intermediate
  • Intel Many Integrated Cores
  • Intel® Many Integrated Core Architecture
  • Phi seems not fully support AVX512? Any way to do MATRIX transpose?

    I found in past topics that mm512_unpacklo_* is not supported on phi. In my own implementation, it seems mm512_permute* and mm512_shuffle* is also not supported. So far all matrix transpose operation in past posts seems implemented by using mm512_swizzle* and mm512_blend* instructions. However, use these two operations requires two times more element movement, seems low efficiency. Is their any other choices to do matrix transpose?


    Optimization Techniques for the Intel® MIC Architecture: Part 1 of 3

    This is part 1 of a 3-part educational series of publications introducing select topics on optimization of applications for the Intel multi-core and manycore architectures (Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors). In this paper we focus on thread parallelism and race conditions. We discuss the usage of mutexes in OpenMP* to resolve race conditions. We also show how to implement efficient parallel reduction using thread private storage and mutexes. For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Such a workload occurs in such applications as Monte Carlo simulations, particle physics software, and statistical analysis. The optimization technique discussed in this paper leads to a performance increase of 25x on a 24-core CPU and up to 100x on the MIC architecture compared to a single-threaded implementation on the same architectures.
  • Developers
  • Professors
  • Students
  • Linux*
  • C/C++
  • Code Modernization
  • Intel® Many Integrated Core Architecture
  • Threading
  • Subscribe to Intel® Many Integrated Core Architecture