The Intel® Xeon Phi™ Processor has launched!
Press Kit & Launch Keynote: https://newsroom.intel.com/press-kits/intel-isc/
Official Website: http://www.intel.com/xeonphi
Full Specifications: http://mark.intel.com/products/family/92650/r
|----------------------- The below section contains disclosures made prior to product launch -----------------------|
Knights Landing is the codename for Intel's 2nd generation Intel® Xeon Phi™ Product Family, which will deliver massive thread parallelism, data parallelism and memory bandwidth – with improved single-thread performance and Intel® Xeon® processor binary-compatibility in a standard CPU form factor. Additionally, Knights Landing will offer integrated Intel® Omni-Path fabric technology, and also be available in the traditional PCIe* coprocessor form factor.
The following is a list of public disclosures that Intel has previously made about the forthcoming product:
|ISA: binary compatible with Intel® Xeon® Processors with support for Intel® Advanced Vector Extensions 512 (Intel® AVX-512)1|
|Code transition: most of today’s parallel optimizations carry forward to KNL|
|Clustering modes: (1) All-to-all: address uniformly hashed across all distributed directories (2) Quadrant: chip divided into 4 quadrants w/directory for address residing in same quadrant as memory location and software transparency (3) Sub-NUMA Clustering: each quadrant (cluster) exposed as separate NUMA domain to O/S, software visible, analogous to 4-socket Intel® Xeon® processor|
|Memory modes: cache, flat (allocation via fast Malloc / FASTMEM) and hybrid (part cache, part flat)|
|NUMA support: multiple NUMA domain support per socket|
|Form factor: bootable host processor (hosting O/S) and PCIe coprocessor (PCIe end-point device)|
|Platform memory: up to 384GB DDR4 using 6 channels, ~90GB/s sustained bandwidth|
|Reliability: “Intel server-class reliability”|
|Density: 3+ KNL with fabric in 1U|
|PCIe*: up to 36 lanes PCIe* Gen 3.0, 2x16 & 1x4|
|Fabric: 2 Intel® Omni-Path fabric ports|
|High-performance on-package memory (MCDRAM)||
|Transitors: over 8 billion transistors per die based on Intel’s 14 nm process technology|
|Cores: up to 72 cores (36 tiles)|
|Core: “Based on Intel® Atom™ core (based on Silvermont microarchitecture) with many HPC enhancements”||
|Tile: 2 cores per tile with 2 vector processing units (VPU) per core|
|Architecture: 2D tile mesh architecture; every row & column of tiles is a ring with messages arbitrated at injection and on turn|
|DMI: 4 lanes for chipset|
|Threading: back-to-back fetch & issue per thread, core resources dynamically repartitioned (shared) between threads, thread selection points|
|TLB (Translation Lookaside Buffer): 1st level uTLB (64 entries for 4K pages), 2nd level dTLB (256 entries for 4K pages, 128 for 2M, 16 for 1G)|
|L2: 1MB shared between 2 cores in a tile, cache-coherent, prefetcher, 16-way, 1 line read, ½ line write per cycle, coherent across all tiles|
|CHA: caching/home agent distributed tag directory to keep L2s coherent, MESIF protocol, mesh connections|
|Cache: fast unaligned and cache-line split support|
|Lcache: 32KB 8-way|
|Dcache: 32KB 8-way, 2x64B load ports, 1 store port|
|Buffers: 2-wide decode/rename/retire, 72 inflight uops/core out-of-order buffers, 72-entry ROB & rename buffers, up to 6-wide at execution, int(2x12) and FP(2x20) RS OoO, MEMRS(1x12) inorder with OoO completion, recycle buffer holds memory ops waiting for completion, int and mem RS hold source data while FP RS does not|
|VPU: 32SP and 16DP, X87, SSE & EMU support|
|NTB: non-transparent bridge to create PCIe coprocessor (processor commonality)|
|Performance monitoring reference manual for device driver developers (link)|
|Peak FLOPS: 3+ TeraFLOPS of double-precision peak theoretical performance per single socket node0 (6+ TeraFLOPS of single-precision)4|
|Memory bandwidth: over 5x STREAM vs. DDR4 (over 400 GB/s)5|
|Single-threading: 3x Single-Thread Performance compared to Knights Corner6|
|SPECint*_rate_base2006: at least >~0.6x perf and >~1x perf/watt of 2-socket Intel® Xeon® processor E5-2697v3 (link pending)7|
|SPECfp*_rate_base2006: at least >~0.8x perf and >~1.2x perf/watt of 2-socket Intel® Xeon® processor E5-2697v3 (link pending)7|
|Machine (deep) learning: 2-2.6x single-node AlexNet training performance of 2-socket Intel® Xeon® processor E5-2699v3 (see slides - click on SPCS008)7|
|Machine (deep) learning: 3-4 hours to train OverFeat-FAST Network of 1.3M images of ImageNet-1k (see slides - click on SPCS008)7|
|Preproduction Intel® Xeon Phi™ processors are running in several supercomputing-class systems. Cray has a system currently running multiple customer applications in preparation for the supercomputer deployments at Los Alamos (Trinity system) and NERSC (Cori system). Systems are also installed at CEA (the French Alternative Energies and Atomic Energy Commission) by Atos and Sandia National Laboratories by Penguin Computing.|
|Knights Landing Developer Access Program (DAP): Pre-order a developer platform TODAY! http://dap.xeonphi.com|
|Early ship program -- contact your Intel representative to find out more|
|Intel Adams Pass board (1U half-width) is custom designed for Knights Landing (KNL) and will be available to system integrators for KNL launch; the board is OCP Open Rack 1.0 compliant, features 6 ch native DDR4 (1866/2133/2400MHz) and 36 lanes of integrated PCIe* Gen 3 I/O|
|Expecting over 50 system providers for the KNL host processor (numerous designs displayed at Supercomputing'15), in addition to many more PCIe*-card based solutions.|
|>100 Petaflops of committed customer deals to date|
|Live Knights Landing demos:
@ISC'15 (June 2015 in Frankfurt, GER): 2-node "MODAL" cosmic simulation
@SC'15 (Nov 2015 in Austin, TX, USA): 16-node "HPC for Music" simulation and 8-node "MPAS-O" weather simulation
|Cori Supercomputer at NERSC (National Energy Research Scientific Computing Center at LBNL/DOE) became the first publically announced Knights Landing based system, with over 9,300 nodes slated to be deployed in mid-2016|
|“Trinity” Supercomputer at NNSA (National Nuclear Security Administration) is a $174 million deal awarded to Cray that will feature Haswell and Knights Landing, with acceptance phases in both late-2015 and 2016.|
The DOE* and Argonne* awarded Intel contracts for two systems (Theta and Aurora) as a part of the CORAL* program, with a combined value of over $200 million. Intel is teaming with Cray* on both systems. Scheduled for 2016, Theta will have greater than 8.5 petaFLOPs and more than 2,500 nodes, featuring the Intel® Xeon Phi™ processor (Knights Landing), Cray* Aries* interconnect and Cray’s* XC* supercomputing platform. Scheduled for 2018, Aurora is the second and largest system with 180-450 petaFLOP/s and approximately 50,000 nodes, featuring the next-generation Intel® Xeon Phi™ processor (Knights Hill), 2nd generation Intel® Omni-Path fabric, Cray’s* Shasta* platform, and a new memory hierarchy composed of Intel Lustre, Burst Buffer Storage, and persistent memory through high bandwidth on-package memory.
|Knights Hill is the codename for the 3rd generation of the Intel® Xeon Phi™ product family||
*Other names and brands may be claimed as the property of others.
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
All projections are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual
1 Binary compatible with Intel® Xeon® Processors v3 (Haswell) with the exception of Intel® TSX (Transactionaly Synchronization Extensions)
2 Projected result based on internal Intel analysis comparison of 16GB of ultra high-bandwidth memory to 16GB of GDDR5 memory used in the Intel® Xeon Phi™ coprocessor 7120P.
3 Compared to the Intel® Atom™ core (based on Silvermont microarchitecture)
4 Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expecations of cores, clock frequency and floating point operations per cycle.
5 Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory with all channels populated.
6 Projected peak theoretical single-thread performance relative to 1st Generation Intel® Xeon Phi™ Coprocessor 7120P
7 See configuration details and disclaimers using provided hyperlink