While there are many different programming models for the Intel® Xeon Phi™ coprocessor (code-named Knights Corner (KNC)), this paper lists the more prevalent KNC programming models and further discusses some of the necessary changes to port and optimize KNC models for the Intel® Xeon Phi™ processor x200 (code-named Knights Landing (KNL)) self-boot (SB) platform.
Instruction Set Compatibility
Virtually all applications running today on an Intel® Xeon® processor-based platform will run on a KNL SB platform without modification. But it is recommended that you recompile your application for KNL to achieve best performance.
KNL supports Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set architecture (ISA), which are 512-bit vector extensions to the 256-bit Intel® Advanced Vector Extensions 2 (Intel® AVX2) SIMD instructions supported on current Intel Xeon processors. So all applications that currently run on Intel Xeon processors can run on KNL. However, the performance may be less than if you were exploiting KNL ISA. You should use the KNL AVX-512 ISA, which includes foundation instructions, conflict detection instructions (CDI), error and reciprocal instructions (ERI), and prefetch instructions (PFI). Note: KNC supported a different 512-bit ISA, so all KNC applications must be recompiled and ported to KNL.
The Intel® Advance Vector Extensions instructions on previous Intel® Xeon® Processor Family and the Intel® Xeon Phi™ Processor x200 (KNL).
As mentioned, nearly everything that runs on an Intel® Xeon® E5-2600 v4 product family (code-named Broadwell-EP) platform will run on KNL. Even legacy binaries from several generations ago will run out-of-the-box with few exceptions. How well code will run mostly depends on how well optimized or efficient the workload is in terms of core scaling, vector scaling, and memory bandwidth.
Optimizations that improve core scaling or parallel efficiency will benefit the application on both Intel Xeon processors and KNL, but KNL to a much greater degree since it has many more cores and threads.
Optimizations that improve vector scaling or SIMD efficiency will also benefit the application on both Intel Xeon and KNL. If you recompile using KNL AVX-512, KNL gains much more from exploiting the many benefits of KNL AVX-512, such as masking with larger and more registers.
If your workload is memory bandwidth sensitive, KNL’s MCDRAM or high bandwidth memory may offer high value, perhaps with little effort. If your workload total memory size is less than 16 GB, you can load your entire workload in MCDRAM and see much higher effective memory bandwidth capability (over 4x DDR). If your required memory size is larger than 16 GB, you can exploit KNL’s cache configuration where MCDRAM is a memory side cache to DDR4 memory, or you can exploit the memkind library now available via github*.
Integrated On-Package Memory Usage Models on the Intel® Xeon Phi™ Processor x200.
There are two issues that must be considered in migrating KNC applications to KNL Self-Boot.
- The implementation type
- The level of intrinsics and assembly code used
If you used the Intel® tools (compilers and performance libraries) for KNC and did not add assembly code or intrinsics, you must just recompile for KNL. Some of the optimizations that were needed to get good performance on KNC are tolerated on KNL, but may not be necessary. One example is data alignment. On KNC an unaligned instruction on aligned data had performance penalties, but on KNL there is no penalty on an unaligned instruction processing aligned data. This does not mean you must remove this code in migrating to KNL. It simply means that the alignment requirements for KNC were stringent and those for KNL are much more flexible, come at little or no cost, and do no harm.
If you wrote key portions of your application using assembly or KNC intrinsics, these will have to be rewritten for KNL. Since both are 512-bit SIMD with masking, most of the intrinsics porting should be easy. White papers on adapting KNC intrinsics to KNL are available at the Intel® Developer Zone.
Working with KNC Implementation Methods
Most KNC applications could be implemented as one of the following:
Native is the simplest form, and a simple recompile for KNL SB will go a long way to creating a KNL SB binary. Most cases that ran well on KNC will run quite well on KNL SB. With a symmetric model, you can run some ranks on an Intel Xeon processor and some on KNC, and for this case, a simple recompile should be sufficient to get it running and in most cases running quite well on KNL.
An offload usage model runs part of the workload on the host and part of it on the KNC coprocessor. This code ports easily to the KNL coprocessor, but this paper is focused on the self-boot platform. You can run on the self-boot platform by using the best host version of the workload or taking advantage of the coprocessor version, which will revert back to running on the host when it sees no coprocessor. If you had vectorization and threading optimizations that were done for KNC, you will want to reuse them for KNL SB platforms.
KNC had some peculiar uarch/compiler deficiencies, which forced some developers to resort to intrinsics for their code (for example, gather/scatter, prefetch, alignment, and so on). KNL has made significant microarchitectural enhancements over KNC, so it is highly recommended to recompile the original reference code and use this as your starting point for KNL.
You must make an independent assessment of whether you should revert to intrinsics or assembly on KNL, but in many cases where this was needed on KNC, the need on KNL is most likely eliminated. The main reasons for this are the maturity and increased capabilities of the Intel compiler, the Intel AVX-512 ISA and uarch improvements. The vectorization reports within the Intel compilers were greatly improved to help you assess and improve vectorization and the Intel® Advisor XE Intel’s Vector tool provides an interactive assist in identifying and exploiting unrealized vectorization opportunities.
The KNL-based platform is transformative in its compatibility with legacy binaries, adherence to open industry-standard development tools and methodologies, and its ability to reveal more value from the most scalable applications. The more you improve the scalability of your software, the better performance you can achieve on the KNL SB-based platform. Please look for more information at the Intel Developer Zone.