OpenMP not using all processors

I am trying to use MKL libraries and OpenMP in a MSVS C++ application on Windows7. The application shows affinity for all 24 processors (2 nodes, 6 processors, HyperThreaded). omp_get_num_procs() also shows 24 processors.  When I run the program only 1 node and 6 processors are accessed. This is confirmed  when I use "KMP_AFFINITY=verbose,none". It ouputs "OMP: Info #179: KMP_AFFINITY: 1 packages x 6 cores/pkg x 1 threads/core (6 total cores)".  I get no compiler or linker complaints.

Intel® Parallel Studio XE 2015 Update 2 Cluster Edition Readme

The Intel® Parallel Studio XE 2015 Update 2 Cluster Edition for Linux* and Windows* combines all Intel® Parallel Studio XE and Intel® Cluster Tools into a single package. This multi-component software toolkit contains the core libraries and tools to efficiently develop, optimize, run, and distribute parallel applications for clusters with Intel processors.  This package is for cluster users who develop on and build for IA-32 and Intel® 64 architectures on Linux* and Windows*, as well as customers running over the Intel® Xeon Phi™ coprocessor on Linux*. It contains:

  • Apple OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8.x
  • Serveur
  • C/C++
  • Fortran
  • Intel® Parallel Studio XE Cluster Edition
  • Interface de transmission de messages
  • OpenMP*
  • Informatique cloud
  • Using L1/L2 cache as a scratchpad memory

    Dear all,

    Explicitly cache control is a one of important feature in Xeonphi (MIC). How could I use the L1 or L2 as scratchpad memory and also sharing them data between the cores?

    In addition,  is there any way to hack the MESI state of the cache line in the distributed tag directory (DTD)? 

    Thanks in advance.


    Performance comparison between Intel TBB task_list, openMP task and parallel for

    I am planning on parallelizing a hotspot in a project. And I would like to know your opinion between the performance evaluation between parallel for, omp single followed by task and intel TBB task_list, under ideal conditions where number of threads are equal to computation items and when computation are much greater than available threads to see scheduling overhead(in order to evaluate the most efficient scheduler). I will also, be writing some sample test programs to evaluate myself but I also wanted to know if anybody had previously made these evaluations.

    Thanks in advance.

    Further information about different barrier algorithms


    I'm researching on barrier algorithms using SIMD instructions and I'm trying to deeply understand the different versions included in the RTL.

    I've noticed that there is a new barrier algorithm (hierarchical) since the last time I had a look.

    Where could I find a further description of them? Could someone from Intel provides me with further information?


    Thank you in advance.

    Kind regards.

    an interesting and serious topic

    Hello there:

             I have found an interesting  appearance which I can not explain. Okey, let's go.

             I apply "micsmc" to surveiling the offload running state of MIC. The critical code like this:

    #pragma offload target(mic:0) inout(XXXX) in (XXXX)
    #pragma omp parallel for schedule (dynamic)
    for( int i = 0; i < num_cluster; i++) // num_cluster from 60 to 300,concentrated on 90~150
      do something....

             And then set the environment variables :

    export OMP_NUM_THREADS=X
    export KMP_AFFINITY=compact

    S’abonner à OpenMP*