What public disclosures has Intel made about Knights Landing?

The Intel® Xeon Phi™ Processor has launched!

Press Kit & Launch Keynote: https://newsroom.intel.com/press-kits/intel-isc/

Official Website: http://www.intel.com/xeonphi

Full Specifications: http://mark.intel.com/products/family/92650/r

Photos: http://download.intel.com/newsroom/kits/xeon/phi/intel-isc-xeon-phi.zip

|----------------------- The below section contains disclosures made prior to product launch -----------------------|


Knights Landing is the codename for Intel's 2nd generation Intel® Xeon Phi™ Product Family, which will deliver massive thread parallelism, data parallelism and memory bandwidth – with improved single-thread performance and Intel® Xeon® processor binary-compatibility in a standard CPU form factor.  Additionally, Knights Landing will offer integrated Intel® Omni-Path fabric technology, and also be available in the traditional PCIe* coprocessor form factor.

The following is a list of public disclosures that Intel has previously made about the forthcoming product:

ISA: binary compatible with Intel® Xeon® Processors with support for Intel® Advanced Vector Extensions 512 (Intel® AVX-512)1
Code transition: most of today’s parallel optimizations carry forward to KNL
Clustering modes: (1) All-to-all: address uniformly hashed across all distributed directories (2) Quadrant: chip divided into 4 quadrants w/directory for address residing in same quadrant as memory location and software transparency (3) Sub-NUMA Clustering: each quadrant (cluster) exposed as separate NUMA domain to O/S, software visible, analogous to 4-socket Intel® Xeon® processor
Memory modes: cache, flat (allocation via fast Malloc / FASTMEM) and hybrid (part cache, part flat)
NUMA support: multiple NUMA domain support per socket
Form factor: bootable host processor (hosting O/S) and PCIe coprocessor (PCIe end-point device)
Platform memory: up to 384GB DDR4 using 6 channels, ~90GB/s sustained bandwidth
Reliability: “Intel server-class reliability”
Density: 3+ KNL with fabric in 1U
PCIe*: up to 36 lanes PCIe* Gen 3.0, 2x16 & 1x4
Fabric: 2 Intel® Omni-Path fabric ports
High-performance on-package memory (MCDRAM)
Up to 16GB at launch
NUMA support
Over 5x Energy Efficiency vs. GDDR52
Over 3x Density vs. GDDR52
In partnership with Micron Technology
8 memory channels
Transitors: over 8 billion transistors per die based on Intel’s 14 nm process technology
Cores: up to 72 cores (36 tiles)
Core: “Based on Intel® Atom™ core (based on Silvermont microarchitecture) with many HPC enhancements”
4 Threads / Core
2X Out-of-Order Buffer Depth3
Gather/scatter in hardware
Advanced Branch Prediction
High cache bandwidth
32KB Icache, Dcache
2 x 64B load ports in Dcache
2x BW between Dcache and L23
46/48 Physical/Virtual Address bits
Tile: 2 cores per tile with 2 vector processing units (VPU) per core
Architecture: 2D tile mesh architecture; every row & column  of tiles is a ring with messages arbitrated at injection and on turn
DMI: 4 lanes for chipset
Threading: back-to-back fetch & issue per thread, core resources dynamically repartitioned (shared) between threads, thread selection points
TLB (Translation Lookaside Buffer): 1st level uTLB (64 entries for 4K pages), 2nd level dTLB (256 entries for 4K pages, 128 for 2M, 16 for 1G)
L1: prefetcher
L2: 1MB shared between 2 cores in a tile, cache-coherent, prefetcher, 16-way, 1 line read, ½ line write per cycle, coherent across all tiles
CHA: caching/home agent distributed tag directory to keep L2s coherent, MESIF protocol, mesh connections
Cache: fast unaligned and cache-line split support
Lcache: 32KB 8-way
Dcache: 32KB 8-way, 2x64B load ports, 1 store port
Buffers: 2-wide decode/rename/retire, 72 inflight uops/core out-of-order buffers, 72-entry ROB & rename buffers, up to 6-wide at execution, int(2x12) and FP(2x20) RS OoO, MEMRS(1x12) inorder with OoO completion, recycle buffer holds memory ops waiting for completion, int and mem RS hold source data while FP RS does not
VPU: 32SP and 16DP, X87, SSE & EMU support
NTB: non-transparent bridge to create PCIe coprocessor (processor commonality)
Performance monitoring reference manual for device driver developers (link)
Peak FLOPS: 3+ TeraFLOPS of double-precision peak theoretical performance per single socket node0 (6+ TeraFLOPS of single-precision)4
Memory bandwidth: over 5x STREAM vs. DDR4 (over 400 GB/s)5
Single-threading: 3x Single-Thread Performance compared to Knights Corner6
SPECint*_rate_base2006: at least >~0.6x perf and >~1x perf/watt of 2-socket Intel® Xeon® processor E5-2697v3 (link pending)7
SPECfp*_rate_base2006: at least >~0.8x perf and >~1.2x perf/watt of 2-socket Intel® Xeon® processor E5-2697v3 (link pending)7
Machine (deep) learning: 2-2.6x single-node AlexNet training performance of 2-socket Intel® Xeon® processor E5-2699v3 (see slides - click on SPCS008)7
Machine (deep) learning: 3-4 hours to train OverFeat-FAST Network of 1.3M images of ImageNet-1k (see slides - click on SPCS008)7
Preproduction Intel® Xeon Phi™ processors are running in several supercomputing-class systems. Cray has a system currently running multiple customer applications in preparation for the supercomputer deployments at Los Alamos (Trinity system) and NERSC (Cori system). Systems are also installed at CEA (the French Alternative Energies and Atomic Energy Commission) by Atos and Sandia National Laboratories by Penguin Computing.
Knights Landing Developer Access Program (DAP): Pre-order a developer platform TODAY! http://dap.xeonphi.com
Early ship program -- contact your Intel representative to find out more
Intel Adams Pass board (1U half-width) is custom designed for Knights Landing (KNL) and will be available to system integrators for KNL launch; the board is OCP Open Rack 1.0 compliant, features 6 ch native DDR4 (1866/2133/2400MHz) and 36 lanes of integrated PCIe* Gen 3 I/O
Expecting over 50 system providers for the KNL host processor (numerous designs displayed at Supercomputing'15), in addition to many more PCIe*-card based solutions.
>100 Petaflops of committed customer deals to date
Live Knights Landing demos:
@ISC'15 (June 2015 in Frankfurt, GER): 2-node "MODAL" cosmic simulation
@SC'15 (Nov 2015 in Austin, TX, USA): 16-node "HPC for Music" simulation and 8-node "MPAS-O" weather simulation
Cori Supercomputer at NERSC (National Energy Research Scientific Computing Center at LBNL/DOE) became the first publically announced Knights Landing based system, with over 9,300 nodes slated to be deployed in mid-2016
“Trinity” Supercomputer at NNSA (National Nuclear Security Administration) is a $174 million deal awarded to Cray that will feature Haswell and Knights Landing, with acceptance phases in both late-2015 and 2016.

The DOE* and Argonne* awarded Intel contracts for two systems (Theta and Aurora) as a part of the CORAL* program, with a combined value of over $200 million.  Intel is teaming with Cray* on both systems.  Scheduled for 2016, Theta will have greater than 8.5 petaFLOPs and more than 2,500 nodes, featuring the Intel® Xeon Phi™ processor (Knights Landing), Cray* Aries* interconnect and Cray’s* XC* supercomputing platform.  Scheduled for 2018, Aurora is the second and largest system with 180-450 petaFLOP/s and approximately 50,000 nodes, featuring the next-generation Intel® Xeon Phi™ processor (Knights Hill), 2nd generation Intel® Omni-Path fabric, Cray’s* Shasta* platform, and a new memory hierarchy composed of Intel Lustre, Burst Buffer Storage, and persistent memory through high bandwidth on-package memory.

Knights Hill is the codename for the 3rd generation of the Intel® Xeon Phi™ product family
Based on Intel’s 10 nanometer manufacturing technology
Integrated 2nd generation Intel® Omni-Path Host Fabric Interface




 *Other names and brands may be claimed as the property of others.
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
All projections are provided for  informational purposes only.  Any difference in system hardware or software design or configuration may affect actual

1 Binary compatible with Intel® Xeon® Processors v3 (Haswell) with the exception of Intel® TSX (Transactionaly Synchronization Extensions)
2 Projected result  based on internal Intel analysis comparison of 16GB of ultra high-bandwidth memory to 16GB of GDDR5 memory used in the Intel® Xeon Phi™ coprocessor 7120P.
3 Compared to the Intel® Atom™ core (based on Silvermont microarchitecture) 
4 Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expecations of cores, clock frequency and floating point operations per cycle. 
Projected result  based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory with all channels populated.

6 Projected peak theoretical single-thread performance relative to 1st Generation Intel® Xeon Phi™ Coprocessor 7120P
See configuration details and disclaimers using provided hyperlink


For more complete information about compiler optimizations, see our Optimization Notice.


quant.geek's picture

This question is probably asked numerous times, but I think it is important.  I have a PC laptop and I am wondering if it is possible to add the Xeon Phi coprocessor card using a PCIe to Thunderbolt 3 chassis.  Will this work?





The specification link, http://mark.intel.com/products/family/92650/Intel-Xeon-Phi-Product-Family-x200#@Server seems to be inactive

Thanks for detail infomation

i knew that it supports Linux and Windows Server OS. but it also supports Windows(7, 8) Enterprise or Prof edition?

Hi Mr. Gardner,
exciting news, but let me ask you few semi-lame questions.

Yesterday I saw a video showing how amazing nVIDIA TITAN X is, I wonder how competitive KNL will be, given that both pieces of hardware are built from 8 billion transistors and have 12GB vs 16GB!
From C programming standpoint, it interests me how KNL will "counter" those 3072 CUDA Cores, are those 240 threads enough to compete when e.g. OpenMP vs CUDA showdown is to take place.

One superb piece of software using both OpenMP and CUDA:

I am under impression that CUDA is more powerful (speaking of sorting) than any Intel CPU, is that so?

The thing that interests me most is writing a 100% free C etude implementing fuzzy search by using 16++ threads, so far I wrote such an etude using OpenMP and scanned the English Wikipedia in 5 hours using 16 cores, naturally I reach for few minutes not few hours:

And please add some more photos (like 'KNL die.jpg') but more sharp and with higher resolution, you see, I appreciate having close-ups of fine piece of hardware - it inspires me and makes me dreaming of a better/faster textual processing.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.