In December 2017, a colleague asked for some help in porting a synthetic "Peak GFLOPS" code to our Xeon Phi 7250 systems. This code just executes register-to-register VFMADD instructions in an unrolled loop. While trying to understand various idiosyncrasies of the performance characteristics, it became clear that the Xeon Phi x200 cannot execute VPU instructions at more than 6/7 of the nominal peak performance -- i.e., 12 VPU instructions should take 6 cycles, but we observe that 7 cycles are required.
This effect was observed in hundreds of test cases, using VPU instructions of any width, any latency, or any ISA, but the effect does *not* apply to ALU or Memory instructions.
My colleague Damon McDougall (now at AMD) made a short presentation on this topic at the IXPUG Fall Conference in September 2018 (https://www.ixpug.org/components/com_solutionlibrary/assets/documents/15...)
A longer write-up, including description of some previously undocumented performance counter masks, is available at https://sites.utexas.edu/jdm4372/2018/01/22/a-peculiar-throughput-limita...
Note that this does not appear to impact the performance of any "real" codes! Even DGEMM is not impacted because about 20% of the instructions in DGEMM are not VPU instructions, so the 2-instruction-per-cycle limit reduces the peak VPU rate to 1.6 instructions per cycle, which is below the limit explored here.