I haven't had time to experiment yet, but the various documents seem to indicate conflicting information about X87 (scalar "old" FPU) instructions supported on Xeon Phi.

Here is the jist of my question: Some of the documents list the X87 as being run independent of the vector floating point. By this I mean not using any of the resources of the vector floating point system. IOW it appears that a core could conceivably have concurrent AVX512 and X87 instructions executing. This is unlike what it is on the IA32 and Intel64 platforms.

Is the above true?

Jim Dempsey

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I don't know what you mean by implying that ia32 and intel64 architectures can't execute simd and x87 instructions together in pipeline.  It was normal to do that in Pentium-III and Pentium-M compilation.  Those builds could still run on current CPUs, and the compile options are still present (I think) in gnu compilers.

Early compilers for Intel(r) Xeon Phi(tm) also used x87 instructions for purposes such as scalar division. and that capability may still be there, although the compilers are much more aggressive in avoiding them.  When we see the x87 instructions in performance profiling now, it's usually a sign that something has gone wrong (e.g. garbage inputs to sincos). but it doesn't prevent the code from running.

If your "resources of the vector floating point system" means only the registers, that's a weakness of x87 on any of the architectures, that there's no usable way to transfer data between x87 and vector registers except by use of cache and memory.


Thanks for replying. By resources I mean the FPU does not use any portion of the vector divider, multiplier, adder, etc..

Whereby, for example an operation by one thread within a core would not interfere with the same operation of any other thread within the core. With the core shared vector unit the different hardware threads within a core get in line at some point (for multiply or divide). Having non-interfering x87 instructions (for one hw thread in a core), could provide for using the otherwise idle 4th thread (often programmed out) to do some useful scalar work without interfering with the vector work.

I am not at all up to speed in using the Xeon Phi (at maximum performance). I can get significant performance improvement (10%-20%) by ganging up the hardware threads within a core. I will have an IDZ blog/technical article on this shortly. In most of the configurations of the test code three threads per core work best. Adding the fourth thread, even executing non-vector integer instruction (e.g. doing the pre-fetch for the siblings in the core) usually slows down the process.  This may not be the case for the 4th thread when issuing rather expensive scalar floating point instructions (divide, sqrt, etc...). Someone on this forum may have experimented with this type of program mix (within a core).

Identifying problems that fit this configuration is a different issue. Of course you wouldn't transfer results between FPU and vector other than through memory (and cache).

If the technique is shown to have value, then perhaps a #pragma x87 could be considered to mark a statement or {scope} to be compiled using x87 instructions.

Jim Dempsey

Single core Xeon Phi core does have x87 register file and x87 FP execution unit which is coupled to pipeline 0.I think that it is up to compiler to insert scalar FP code  and later hardware arbiter or scheduler to round-robin hardware threads in order to fully utilize x87 execution stack in case of constant load on the Vector RF and its execution unit(s). 

Leave a Comment

Please sign in to add a comment. Not a member? Join today