The Intel® Many Integrated Core Architecture (Intel® MIC Architecture) provides a product family optimized to deliver performance for highly parallel applications or highly parallel kernels within applications. The Intel® Xeon Phi™ coprocessor is a member of this product family.
This chapter describes the characteristics of applications that are ideally suited for running on highly parallel and highly vectorized architectures, such at the Intel® Xeon Phi™ coprocessor.
Finding the Optimal Architecture for YOUR Application
Many applications have been relying on processor speed bumps to carry them forward without underlying changes to coding methods and practices. The Intel® Xeon® processor and the Intel® Compilers have evolved over the years to deliver outstanding performance to a wide range of applications. So will your application run optimally on the Intel® MIC architecture? Will it be as easy as a recompilation? Not necessarily. Not all applications are ideal candidates for optimal performance on an Intel® Xeon Phi™ coprocessor. Some applications may be able to obtain good performance on Intel® MIC architecture with little to no changes to your code, with a simple recompilation, and with a few additional or new compiler options to your application build process. Other applications may require minor addition rework such as adding compiler pragmas or directives to the code along with use of compiler options designed for targeting Intel® MIC architecture. And some applications may require algorithmic or high-level rework such as changing arrays of structures (AoS) to structures of arrays (SoA), or perhaps removing excessive pointer chasing, or perhaps assisting alignment of major data structures to obtain optimal performance. And finally, you may find that some applications in their current form may see ideal application performance on a general purpose processor such as Genuine Intel® Xeon® processors without a major overhaul of the application code.
Before attempting to port applications to the Intel® Xeon Phi™ coprocessor, you should first consider your application's CPU processor characteristics and it's suitability to run efficiently on Intel® Xeon Phi™. Although most applications can be ported to run on the Intel® Xeon Phi™ coprocessor, to get most applications to run fast and efficiently requires that applications already be highly parallel and highly vectorizable. The Intel Compilers can assist you create the best performance for your application in it's current state, but it is not a magic tool that can convert a poor performing CPU code into a stellar performing code on Intel® Xeon Phi™ without user assistance. This leads to the first set of questions you should ask yourself:
"Is my application highly parallel and highly vectorized in it's current state running on general purpose CPU processor-based hosts?"
If the answer is either "No" or "I really don't know", the next question should be
"Am I willing to work on my application to make it more parallel and more vectorized BEFORE I attempt a port to Intel® MIC Architecture?"
"How much effort will it take to make this application run optimally on Intel® MIC architecture?"
Most applications will benefit from a preliminary reexamination, a performance analysis, and perhaps some tuning. Changes made in this process will benefit application performance on both Intel® Xeon processors and the Intel® Xeon Phi™ coprocessors. At the end of this analysis you may determine that your application is well suited for running on Intel® MIC Architecture. Then again, you may find that for this specific application, ideal application performance is to be found on a general purpose processor such as a Genuine Intel® Xeon® processor. In either case, Intel provides solutions for optimal application performance.
Preliminary Work: Finding the Optimal Target Platform for your Application
Your next step is to determine the amount of parallelization and vectorization exhibited by the application in it's current form while running on a host CPU system. To get an overview of architectural considerations between Intel® Xeon® Architecture processors versus Intel® Xeon Phi™ coprocessors, read this article and paper by James Reinders.
Is My Application Highly Parallel?
Hopefully this is a question to which you can answer "Yes" without hesitation. If your application uses some form of threaded parallelism such as OpenMP*, Intel® Threading Building Blocks, Intel® Cilk™ Plus spawn/synch, auto-parallelization from the Intel compilers, you should be seeing nearly linear scaling to the maximum number of physical cores in your server ( 8+ cores or better ). If this is true, skip ahead to the next question below. Also, if your application uses MPI* and exhibits good scaling, also skip ahead to the next question.
If your application is not parallelized or is not scaling well, STOP and work on your application before moving to The Intel® Many Integrated Core Architecture (Intel® MIC Architecture) . You cannot be successful on a highly parallel architecture with an application that is NOT highly parallel. This is a good time to either work on the code to improve it's parallelization and scalability. Below are some resources to help.
Intel Parallel Universe Magazine provides a wealth of information on parallelization techniques
Intel® Developer Zone has numerous videos on parallelization techniques and technologies
Intel® Theading Building Blocks (Intel® TBB) is an ideal solution for C++ programmers comfortable working with template libraries
Intel® Cilk Plus™ is an extension to C and C++ that offers a quick, easy and reliable way to improve performance of programs on multicore processors.
OpenMP* is an open standard fully supported by both the Intel® C/C++ and Intel® Fortran compilers. Here is a Getting Started with OpenMP* guide from Intel, and here is the OpenMP* Specification website. Your compiler also comes with documentation on OpenMP* and provides sample code in your product's Sample/ directory.
Is My Application Vectorized?
Vectorization is a form of data-parallel parallelism (SIMD) provided by Intel® Architecture platforms. Vectorization is enabled by default at optimization level -O2 and above. The compiler options -x[arch] and -ax[arch] control vectorization on Xeon. To disable vectorization, use the "-no-vec -no-simd" compiler options. The option -no-vec will disable all auto-vectorization including vectorization of array notation statements. The option -no-simd will disable vectorization of loops that have SIMD pragma.
As a first step, on your CPU host server, compile your application with -xhost or an appropriate -x option along with all other options used to build your application at -O2 or -O3. Run a serial (1 process, 1 thread) version of the application. Next, remove the -xhost or -x[arch] option and add -no-vec -no-simd. Once again, run your application with 1 process and 1 thread on your Xeon host. Compare timings between the vectorized and non-vectorized case. Did vectorization improve your application? If so, and if the speedup was significant, that is a good indicator that your application is taking advantage of vectorization. If not, do not worry: this guide has an entire chapter on improving vectorization.
Another test is to add the -xhost or -x[arch] to your build at -O2 or above and also add the -vec-report3 option. Does the report generated by -vec-report3 show critical loops vectorizing? That is another indicator that the compiler is able to auto-vectorize your code. Again, we have much more advice and material on vectorization in this guide to help you get better vectorization of your application.
If you are new to vectorization, below is are links to a quick webinar and a toolkit to help you get familiar with the concepts:
Future Proof Your Application's Performance With Vectorization webinar video, along with the resources and examples presented in the webinar available in the Vectorization Toolkit.
This chapter described the characteristics of applications that are ideally suited for running on highly parallel and highly vectorized architectures, such as the Intel® Xeon Phi™ coprocessor. Applications ideally suited for Intel® MIC Architecture are highly parallel and highly vectorizable. If your application does not have BOTH of these characteristics then it may not run optimally on this architecture. In this case, you will have additional work to do to the application before moving to Intel® Many Integrated Core Architecture (Intel® MIC Architecture).
It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ coprocessors. The paths provided in this guide reflect the steps necessary to get best possible application performance.