Xeon Phi not using more than 1 or 2 cores

Xeon Phi not using more than 1 or 2 cores


We have an engineering sample of the Xeon Phi (60 core version). We have installed the mpss_gold_update_3-2.1.6720-16-rhel-6.4


1) We followed the instructions to upgrade flash but micinfo is reporting this:

Host: Linux

OS version: 2.6.32-358.el6.x86_64

Driver version: 6720-16

MPSS version: 2.1.6720-16

BUT: flash version, SMC, UOS and device serial number are all reported as "NotAvailable"...


2) micsmc produces error messages every few seconds such as "Warning: mic0 device connection lost!" and "Information: mic0: Device connection restored"


3) We have a C++ native application running on the Phi. It can be configured to run multithreaded (pthreads), we have taken benchmarks. It runs fractionally more slowly in 240 threads (60 cores x 4 HW threads per core) compared to a single thread... (we have accounted for the overhead of starting new threads)

The application runs against a fixed size sample data as follows:

- when compiled for the host system: 0.2 seconds

- when run on the Phi in a single thread: 0.59 seconds

- when run on the Phi in 240 threads: 0.6 seconds

Repeated runs while micsmc is running shows that no more than 2 cores are being used at any one time...


Any help with this would be much appreciated

26 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


We greatly increased the run time of our sample code (using 240 threads) - and watched performance using top.

All cores are being utilised but less than 1% of each core is being reported by top.

(We simplified the code so it reads through a 1.6M memory buffer thousands of times)

So we are thinking the problem is related to memory bandwidth...

Let us know what you think - perhaps you have standard benchmarking software we could try on our card?

Running only a single thread on the coprocessor will gain you access to only half the cycles available on a single core, so performance in that scenario SHOULD be pretty poor.  And running 240 threads on 60 cores may have pushed you beyond any performance elbows that may be present because of the activity (or lack of it) in your test application.  Have you tried collecting data with thread counts anywhere between these extremums?

It also wouldn't hurt if you could show us what your test program is doing.

A program that runs for only a fraction of a second is also troublesome.  Remember, we're runing a Linux kernel here, and you're asking the OS to create 240 threads and all the associated OS data structures associated with a thread.  Even on a full-blown Xeon, that is a time-consuming *serial* process.  A workload with a longer runtime will make it much easier to work out where you are running into an application issue, and where you are just running into the OS having a lot of work to do (a longer program should minimize the overall effect of the OS).

Ooops, sorry, didn't read clearly.  How much longer is your runtime in the threaded code now? 

I agree with Robert - seeing your code would be useful.

Are not you over subscribing your data with large number of threads?As Charles said for short period of running code you can accumulate an overhead related to creation 240 threads and their data structures.So partly core`s hardware threads will spent some time creating OS threads for your program.

Another question is related to your application and its ability to be efficiently threaded.

There was a bug in the code which was dividing the workload amongst threads... so that explains a lot. Thanks for all your comments though.

Out of interest, we were start all threads, halting them at a thread barrier until they were all ready. Then releasing them and starting a timer... The end timer was after all threads have completed their workload and joined. The only synchronisation point between threads was a __synch_add_and_fetch.

With the bug fixed, we are seeing better performance for much larger inputs, pushing runtimes up to several seconds (60% CPU showing in top) - but there is still some kind of overhead which is making the threaded version look slower for shorter runtime. Perhaps the join? This is for curiosity only since our target application will involve streaming very large data volumes.

Can you profile your application with Vtune?

>>The only synchronisation point between threads was a __synch_add_and_fetch.

I am assuming the __synch_add_and_fetch was used for a thread completion count barrier.

Try replacing the counter with an array of done flags (initialized to 0). This converts atomic read/modify/write with atomic write. Your barrier code becomes:

volatile int DoneFlags[nThreads];
for(int i=0;i<nThreads;++i) DoneFlags[i] = 0;

DoneFlags[iThread] = 1; // (omp_thread_num())
for(int i=0;i<nThreads;++i) { while(!DoneFlags[i]) _mm_pause(); }

If the barrier is reached repetatively within a parallel region, consider changing done flag to trip count (and your test for done too).

DoneFlags[iThread] = iTrip; // iTrip is in local scope of thread, initialized to 1
for(int i=0;i<nThreads;++i) { while(DoneFlags[i] != iTrip) _mm_pause(); }

I don't have my Xeon Phi yet so I cannot confirm if you can use byte sized flags, 32-bit flag should work though. On Sandybridge, the write mask is capable of working in byte sized units. Flag/Trip count checking can use wider units.

Jim Dempsey


Caution: using an array of "done" flags offset at integer widths across an array of HW threads will likely expose your code to significant "false sharing" where the manipulation of individual flags will have cache-line effects causing thrashing in the neighboring thread flags, as the cache lines are modified and evicted.


The Xeon Phi has latency problems with XCHGADD (and other interlocked instructions). The direct write was suggested as a means to reduce the latency (though it is not a perfect solution). The user is free to experiment as to if the flags are packed into adjecent locations or spread across individual cache lines. For barrier I suggest using packed because the scan of all done flags can be made with fewer cache line loads. Though the writes are slower, the writes are made at skewed time intevals, thus for all but the last thread, the extra write latency consumes spinwait time.

One must make the pudding though.

Jim Dempsey


I can't post any code at the moment but this information might help pinpoint some problems:

- A large reference data file is loaded (161Mb). The code accesses this data in a fairly random order...

- A global input buffer of 1.8Mb is loaded

- A global results buffer of 8Mb is allocated

- The input is divided evenly amongst the threads.

- The threads work independently apart from when writing to the output buffer. The outbuf index is controlled by the _synch_add_and_fetch. Apart from the thread barrier (which only exists for timing purposes), this is the only synchronisation point.

It's not easy to CPU utilisation figures due to the short runtime but it only seems to hit around 40% max

If the amount of output per segment of input is the same, then you do not need a _synch_add_and_fetch to acquire a slot in the output buffer. (outOffset = (8/1.8)*inOffset*iThread)

Your requirements may differ.

Jim Dempsey


Thanks but the output size is not known beforehand.

We noticed that when timing individual threads, the thread times were averaging 0.004s (longest being about 0.006s) so there is a some large overhead associated with joining the threads since the total runtime is currently 0.012 seconds (I made recent improvements - the main change was that I forced CPU affinity per thread)

Trying to use VTUNE 2013 GUI...

I get an error "Error: Problem accessing the sampling driver"

I have followed the instructions to install the driver but it didn't help - I noticed that I got a file not found error when running sep_micboot_create.sh (I also tried the older version under the vtune_amplifier_xe directory and got the same problem).

It could not find "sep3_8-k1om-*.ko"

sep_micboot_create.sh has been a ghost for at least several releases--at one point it just echoed a message indicating its deprecation.  My usual advice for dealing with this error: uninstall/ream/reinstall.  That is, run the uninstall.sh script in the VTune Amplifier installed location (nominally /opt/intel/vtune_amplifier_xe is a symlink to the current install).  Then rmdir /opt/intel/vtune_amplifier_xe_2013 and any "sep" directories in /opt/intel/mic (to ensure no old .ko files are left around to confuse things).  Then reinstall VTune; the installer should ask whether you want it to install the coprocessor driver--let it do so if possible.  The VTune Amplifier installation should automatically do a service mpss restart (thus the message it spits out that this may take a little time).  If that sequence doesn't clear the message you reported, yours would be the first case.

If best performance in the barrier is required, I would at least try to get the HW threads per core to team together and issue one barrier-reached notice per quad, to reduce snoop traffic on the rings

Thanks Robert.

I tried this but the install complains that we don't have  a supported version of Linux (uname is just showing GNU/Linux but I think its CentOS)

Everything was reinstalled but no mpss restart was done (I did one manually)

I disabled NMI watchdog timer on the host as requested (mic doesnt have one)

No joy... could this unsupported OS problem also cause performance issues when running native MIC programs?

>>If best performance in the barrier is required, I would at least try to get the HW threads per core to team together and issue one barrier-reached notice per quad, to reduce snoop traffic on the rings.

Good advice - don't use synch_add_and_fetch on the local (core) team else this will post ring traffic too. Use the multiple flags (4) and have one team member be the one performing the synch_add_and_fetch on the global barrier.

Jim Dempsey


If it's just CentOS, that should in itself not be a problem.  We have Intel Xeon Phi coprocessors on CentOS machines in our lab that work fine.  Is the warning you get just that, a warning, or does it actually abort the install?  And after the service mpss restart, are you then able to run a micinfo that reveals details from the coprocessor side (rather than getting a bunch of fields marked "NotAvailable")?  With mpss service running, what happens when you run "miccheck"?

The install will proceed despite an 'unsupported OS' and the drivers are reported to be installed. I then run the script to setup the environment,  start the amplxe-gui, set it up to run my native mic code using ssh.

When I run a hotspot analysis, I can see my program run to completion in the terminal window but then the message appears in the bottom box of the gui saying that no results were collected and that the sampling driver may need to be restarted

Micinfo is still reporting NotAvailable for all the "version" section fields

Miccheck reports OK for everything.

I know CentOS is just rebranded Redhat - I was wondering if somehow fooling the install to think it was Red Hat might be a good idea?

BTW - we are getting excellent performance now - the solution in our case was to increase the number of threads greatly (to mask slow memory accesses) but there are just one or two threads that are spoiling the party by running 5x slower - so it would be great to get vtune to work to find out why...


I had no problems running on CentOS 6.4. In order to make CentOS 6.4 machine appear as RedHat 6.4 do the following (as root):

  1. rm /etc/redhat-release
  2. echo "Red Hat Enterprise Linux Server release 6.4 (Santiago)" > /etc/redhat-release

I've being told the release version doesn't really matter, as long as it says "Red Hat Enterprise Linux Server ..."

The combination of symptoms you report are confusing me a little bit.  It sounds like you're able to run your native code on the coprocessor, and even now see some scaling now that you've upped the thread count, but still have problems with the VTune Amplifier collector on the coprocessor, and on getting status from the coprocessor via micinfo.  Sounds like Intel MPSS is at least partially running.  If you do a service mpss start/restart (one or the other) and then immediately run micinfo under sudo or some other root enabler, are you still getting NotAvailable for the coprocessor specific fields?  If those fields are still coming up empty, my first suspicion would be a mismatch between the MPSS version installed and the flash downloaded into the coprocessor.  Getting a clean run of micinfo would be my first priority.  I was hoping that miccheck might show something; the lack of that confiirmation is part of what is confusing here.

Yes, I get the NotAvailable message regardless...

I installed the software we have from the following files:



As far as I can tell, everything is working normally apart from the install won't install the sampling drivers onto the mic.

I tried making the install think it is Redhat but that hasn't fixed the sampling drivers problem. I think we probably have an old version of amplifier which still has the sep_micboot script. It compares what drivers the mic is expecting compared to what drivers it has - and the mic wants version _38 drivers which the install doesn't have...


And apart from the fact that you're getting NotAvailable for some of the micinfo fields.  And that could be a problem regardless of the version of VTune Amplifier that you have.  Let me run a little demonstration and see whether any of this has any bearing on what you're seeing.

If I shut down mpss as a service on a machine, then try to run micinfo, I see something like the following

$ sudo /opt/intel/mic/bin/micinfo
MicInfo Utility Log
Created Tue Sep 24 14:14:00 2013
 System Info
 HOST OS : Linux
 OS Version : 2.6.32-220.el6.x86_64
 Driver Version : 6720-16
 MPSS Version : 2.1.6720-16
 Host Physical Memory : 65923 MB
Device No: 0, Device Name: mic0
 Flash Version : NotAvailable
 SMC Firmware Version : NotAvailable
 SMC Boot Loader Version : NotAvailable
 uOS Version : NotAvailable
 Device Serial Number : NotAvailable
 Vendor ID : 0x8086
 Device ID : 0x225d
 Subsystem ID : 0x3608
 Coprocessor Stepping ID : 2
 PCIe Width : x16

I get NotAvailable values, particularly regarding the details of the flash version and other version parameters associated with the coprocessor.  However, for this machine all I need to do is start mpss running, the picture changes:

$ sudo service mpss start
[sudo] password for rreed:
Starting MPSS Stack: [ OK ]
mic0: online (mode: linux image: /lib/firmware/mic/uos.img)
mic1: online (mode: linux image: /lib/firmware/mic/uos.img)
$ sudo /opt/intel/mic/bin/micinfo
MicInfo Utility Log
Created Tue Sep 24 14:21:53 2013
 System Info
 HOST OS : Linux
 OS Version : 2.6.32-220.el6.x86_64
 Driver Version : 6720-16
 MPSS Version : 2.1.6720-16
 Host Physical Memory : 65923 MB
Device No: 0, Device Name: mic0
 Flash Version :
 SMC Firmware Version : 1.15.4830
 SMC Boot Loader Version : 1.8.4326
 uOS Version :
 Device Serial Number : ADKC30900885
 Vendor ID : 0x8086
 Device ID : 0x225d
 Subsystem ID : 0x3608
 Coprocessor Stepping ID : 2
 PCIe Width : x16

All the NotAvailable fields have been filled in.  Note that at no time did I do anything regarding VTune Amplifier.  If your machine does not behave in a similar fashion, then there may be something more fundamental than merely an old copy of VTune Amplifier.  In particular, if the NotAvailable fields you're seeing include the fields that indicate the flash and SMC firmware versions, micinfo will not reveal their values.  With what we know, we cannot determine whether flash and SMC firmware versions are in sync with your MPSS version.

I am pretty sure that (at least one of) the problem(s) is that our composer software is old. We installed an evaluation copy of vtune yesterday and it works fine on our native Phi application. This hasn't fixed the micinfo problem (and I have certainly tried restarting mpss). We're going to purchase a license for the latest release which will take time to be authorised. In the meantime we seem to be getting good performance from the Phi - well within the parameters we were hoping for. Thanks for your help.

Leave a Comment

Please sign in to add a comment. Not a member? Join today