Programming for Multicore and Many-core Products including Intel® Xeon® processors and Intel® Xeon Phi™ X100 Product Family coprocessors

By James R.,

Published:11/12/2012   Last Updated:11/12/2012

Programming for Multicore and Many-core Products including Intel® Xeon® processors and Intel® Xeon Phi™ X100 Product Family coprocessors (including language extensions for offloading to Intel® Xeon Phi™ coprocessors)


The programming models in use today, used for multicore processors every day, are available for many-core coprocessors as well.  Therefore, explaining how to program both Intel Xeon processors and Intel Xeon Phi coprocessor is best done by explaining the options for parallel programming. This paper provides the foundation for understanding how multicore processors and many-core coprocessors are best programmed using a unified programming approach that is abstracted, intuitive and effective. This approach is simple and natural because it fits applications of today easily and yields strong results. When combined with the common base of Intel® architecture  instructions utilized by Intel® many-core processors and Intel® multi-core coprocessors, the result is performance for highly parallel computing with substantially less difficulty than with other less intuitive approaches.

Programs that utilize multicore processors and many-core coprocessors have a wide variety of options to meet varying needs. These options fully utilize existing widely adopted solutions, such as C, C++, Fortran, OpenMP*, MPI and Intel® Threading Building Blocks (Intel® TBB), and are rapidly driving the development of additional emerging standards such as OpenCL* as well as new open entrants such as Intel® Cilk™ Plus.


Single core processors are a shrinking minority of all the processors in the world. Multicore processors, offering parallel computing, have displaced single core processors permanently. The future of computing is parallel computing, and the future of programming is parallel programming.

The methods to utilize multicore processors have evolved in recent years, offering more and better choices for programmers than ever. Nothing exemplifies this more than the rapid rise in popularity of Intel TBB or the industry interest and support behind OpenCL.

At the same time that multicore processors and programming methods are becoming common, Intel is introducing many-core processors that will participate in this evolution without sacrificing the benefits of Intel architecture. Additional capabilities that are new with many-core processors are addressed in a natural and intuitive manner Intel® many-core processors allow use of the same tools, programming languages, programming models, the same execution models, memory models and behaviors as in Intel’s multicore processors.

This paper explains the programming methods available for multicore processors and many-core processors with a focus on widely adopted solutions and emerging standards.

Parallel Programming Today

Since the goal of using Intel architecture in both multicore processors and many-core coprocessors is intuitive and common programming methods, it is important to first review where parallel programming for multicore stands today and understand where it is headed. Because of their common Intel architecture foundations, this will also precisely define the basis for parallel programming for many-core processors.


Libraries provide an important abstract parallel programming method that needs to be considered before jumping into programming. Library implementations for algorithms including BLAS, video or audio encoders and decoders, Fast Fourier Transforms (FFT), solvers and sorters, are important to consider. Libraries such as the Intel® Math Kernel Library (Intel® MKL) already offer advanced implementations of many algorithms that are highly tuned to utilize Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector Extensions (Intel® AVX), multicore processors and many-core coprocessors. A program can start to get these benefits by adding a single call to a routine in Intel MKL that includes support for industry standard interfaces in both Fortran and C to the Linear Algebra PACKage (LAPACK). Standards combined with Intel’s pursuit of high performance, make libraries an easy choice to utilize as the first preference in parallel programming.

When libraries do not solve specific programming needs, developers turn to programming languages that have been in use for many years.

None of the most popular programming languages were designed for parallel programming. This has brought about many proposals for new programming languages as well as extensions for the pre-existing languages. In the end, these experiences have led to the emergence of a number of widely deployed solutions for parallel programming using C, C++ and Fortran.

The most widely used abstractions for parallel programming are OpenMP (primarily C and Fortran), Intel Threading Building Blocks (primarily C++) and MPI (C, C++ and Fortran). These support a diverse range of processors and operating systems, making them truly versatile and reliable choices for programming.

Additionally, the native threading methods of the operating system are directly available for programmers. These interfaces, including POSIX threads (pthreads) and Windows* threads, offer a low level interface for full control but without the benefits of high level programming abstractions. These interfaces are essentially assembly language programming for parallel computing. Programmers have all but completely moved to higher levels of abstraction and abandoned assembly language programming. Similarly, avoiding direct use of threading models has been a strong trend that has accelerated with the introduction of multicore processors. This shift to program in “tasks” and not “threads” is a fundamental and critical change in programming habits that is well supported by the abstract programming models.

Most deployed parallel programming today is either done with one of the three most popular abstractions for parallelism (OpenMP, MPI or Intel TBB), or done using the raw threading interfaces of the operating system.

These standards continue to evolve and new methods are proposed. Today, the principle technical drivers of these evolutions are highly data parallel hardware and advancing compiler technology. Both of these driving forces are motivated by a strong desire to program at higher levels of abstraction so as to increase programmer productivity leading to faster time-to-money and reduced development and maintenance costs.

Most Composable Parallel Programming Models

Learn more at

For reasons explained in this paper, the most composable parallel programming methods are (Intel TBB and Intel® Cilk™ Plus. They consistent advantages for effective abstract programming that yield performance and preserve programming investments. They provide recombinant components that can be selected and assembled for effective parallel programs. Even though they can be described and studied individually, they are best thought of as a collection of capabilities that are easily utilized both individually and together. This is incredibly important since it offers composability for mixing modules including libraries. Both have self-composability, which is not the case for threading and for OpenMP. Uses of OpenMP and OpenCL have limitations on composability, but are still preferable to the use of raw threads. One of the benefits of Intel architecture multicore processors and many-core coprocessors is the strong support for all these methods that is available, offering the solution that best fits your current and future programming needs.


Learn more at

In 1996, the OpenMP standard was proposed as a way for compilers to assist in the utilization of parallel hardware. Now, after more than a decade, every major compiler for C, C++ and Fortran supports OpenMP. OpenMP is especially well suited for the needs of Fortran programs as well as scientific programs written in C. Intel is a member of the OpenMP work group and a leading vendor of implementations of OpenMP and supporting tools. OpenMP is applicable to both multicore and many-core programming.

OpenMP dot product example, in Fortran:

!$omp do
   do j = 1, n
      adotb = adotb + a(j) * b(j)
    end do
!$omp end do


OpenMP summation (reduction) example, in C:

#pragma omp parallel for reduction(+: s)
for (int i = 0; i < n; i++)
   s += x[i];


In the future, the OpenMP specification will expand to standardize the emerging controls for attached computing often called “offloading” or “accelerating.” Today, Intel offers non-standard extensions to OpenMP called “Language Extensions for Offload” (LEO). The OpenMP committee is reviewing LEO as well as a set of non-OpenMP offload directives for GPUs known as OpenACC, with an eye towards convergence to serve both Intel Xeon Phi coprocessors and GPUs.

Intel TBB

Learn more at

Intel introduced Intel® TBB in 2006 and the open source project for Intel TBB was started in 2007. By 2009, it had grown in popularity to exceed that of OpenMP in terms of number of developers using it (per research from Evans Data Corp:, and support in subsequent research reports as well). Intel TBB is especially well suited for the needs of C++ programmers, and since OpenMP is designed to address the needs of C and Fortran developers there is virtually no competition between Intel TBB and OpenMP.  It is worth noting that for C++ programmers, using OpenMP and Intel TBB in the same program is possible as well.

Parallel function invocation example, in C++, using Intel TBB:

parallel_for (0, n,
   [=](int i) {


The emergence of Intel TBB, which does not directly require nor leverage compiler technology, emphasized the value of programming to tasks and led the way for wide acceptance of using task-stealing systems. Compiler technology continues to evolve to help address parallel programming and led to the creation of the Intel Cilk™ Plus project. Increased use of compiler technology is better able to unlock the full potential of parallelism. Intel remains a leading participant and contributor in the Intel TBB open source project as well as a leading supplier of Intel TBB support and supporting tools. Intel TBB is applicable to multicore and many-core programming.


Learn more at

For programmers utilizing a cluster, in which processors are connected by the ability to pass messages but not always the ability to share memory, the Message Passing Interface (MPI) is the most common programming method. In a cluster, communication continues to use MPI, as they do today, regardless of whether a node has many-core processors or not.

Today’s MPI based programs move easily to Intel Xeon Phi coprocessor based systems because the Intel coprocessors support ranks that can talk to other coprocessor ranks and multicore (e.g., Intel Xeon® processors) ranks. An Intel Xeon Phi coprocessor, like a multicore processor, may create as many ranks as the programmer desires. Such ranks communicate with other ranks regardless of whether they are on multicore or many-core processors.

Because Intel Xeon Phi coprocessors are general-purpose, MPI jobs run on the coprocessors. This is very powerful because no algorithmic recoding or refactoring is required to get working results from an existing MPI program.  The general capabilities of the coprocessors combined with the power of MPI support on the Intel Xeon Phi coprocessors produce immediate results in a manner that is intuitive for MPI programmers.

The widely used Intel® MPI library offers both high performance and support for virtually all interconnects. The Intel MPI library supports both multicore and many-core in systems creating ranks on multicore processors and many-core coprocessors in a fashion that is familiar and consistent with MPI programming today.

MPI, on Intel Xeon Phil coprocessors, composes with other thread models (e.g., OpenMP, Intel TBB, Intel® Cilk™ Plus) as has become common on multicore processors based systems.

Intel is a leading vendor of MPI implementations and tools. MPI is applicable to multicore and many-core programming.

Parallel Programming Emerging Standards

For data parallel hardware, the emergence of support for certain extensions to C, C++ offers important options for developers and address programmer productivity.

Intel® Cilk™ Plus

Learn more at

Intel introduced Intel Cilk Plus in late 2010. Built on research from M.I.T. and product experiences by industry leader Cilk Arts, Intel implemented support for task stealing in compilers for Linux* and Windows. Intel has published full specifications for Intel Cilk Plus to help enable other implementations as well as optional usage of Intel runtime or construction of interchangeable runtimes via API compliance. Intel is actively working with other compilers to offer support in the future for more compilers. Intel is proud to be the leading supporter in industry of Intel Cilk Plus with products and tools.

Intel Cilk Plus provides three new keywords, special support for reduction operations, and data parallel extensions. The keyword cilk_spawn can be applied to a function call, as in x = cilk_spawn fib(n-1), to indicate that the function fib can execute concurrently with the subsequent code. The keyword cilk_sync indicates that execution has to wait until all spawns from the function have returned. The use of the function as a unit of spawn makes the code readable, relies on the baseline C/C++ language to define scoping rules of variables, and allows Intel Cilk Plus programs to be composable.

Parallel spawn in a recursive fibonacci computation, in C, using Intel Cilk Plus:

int fib (int n) {
   if (n < 2) return 1;
   else {
      int x, y;
      x = cilk_spawn fib(n-1);
      y = fib(n-2);
      return x + y;


Cilk offers exceptionally intuitive and effective compiler support for C and C++ programmers. Cilk is very easy to learn and poised to be widely adopted. A regular “for” loop, without inter-loop dependencies, can be transformed into a parallel loop by simply changing the keyword “for” into “cilk_for.” This indicates to the compiler that there is no ordering among the iterations of the loop.

Parallel function invocation, in C, using Intel Cilk Plus:

cilk_for (int i=0; i<n; ++i){


Cilk programmers still utilize Intel TBB for certain algorithms or features where new compiler keywords or optimizations are not needed, such as the thread aware memory allocator or a sort routine. Intel Cilk Plus is applicable to multicore and many-core programming.

C/C++ data parallel extensions

Learn more at

Debate about how to extend C (and C++) to directly offer data parallel extensions is on-going. Implementations, experiences and adoption are important steps toward standardization. Intel has implemented extensions for fundamental data parallelism as part of Intel Cilk Plus for Linux, Windows and Mac* OS X systems. Intel is actively working with other compilers to offer support in the future. An intuitive syntactic extension, similar to the array operations of Fortran 90, is provided as a key element of Intel Cilk Plus and allows simple operations on arrays. The C/C++ languages do not provide a way to express operations on arrays. A programmer has to write a loop and express the operation in terms of elements of the arrays, creating unnecessary explicit serial ordering. A better opportunity exists to write a[:] = b[:] + c[:]; to indicate the per element additions but without specifying unnecessary serial ordering. These simplified semantics free up a compiler to always generate vector code instead of generating non-optimal scalar code.

An additional method to avoid unintended serialization, allows a programmer to write a scalar function in standard C/C++ and declare it as a "SIMD enabled function" (occassionally this has previously been called by the less descriptive name “elemental function.”) This will trigger the compiler to generate a short vector version of that function, which instead of operating on a single set of arguments to the function, will operate on short vectors of arguments by utilizing the vector registers and vector instructions. In common cases, where the control flow within the function does not depend on data values, the execution of the short vector function can yield a vector of results in roughly the same time it would take the regular function to produce a single result.

SIMD enabled function, in C, using Intel Cilk Plus:

__declspec (vector) void saxpy(float a, float x, float &y)
   y += a * x;


Intel is supporting these syntactic extensions for C and C++ with products and tools as well as discussions with other compiler vendors for wider support. C, C++ and data parallel extensions are applicable to multicore and many-core programming.


Learn more at

OpenCL was first proposed by Apple* and then moved to an industry standards body of which Intel is a participant and supporter. OpenCL offers a “close to the hardware” interface, offering some important abstraction and substantial control coupled with wide industry interest and commitment. OpenCL may require the most refactoring of any of the solutions covered in this whitepaper. Specifically, refactoring based on advanced knowledge of the underlying hardware. Results from refactoring work may be significant for multicore and many-core performance, and the resulting performance may or may not be possible without such refactoring. A goal of OpenCL is to make an investment in refactoring productive when it is undertaken. Solutions other than OpenCL may offer alternatives to avoid the need for refactoring (which is best done when based on advanced knowledge of the underlying hardware).

Simple per element multiplication using OpenCL:

kernel void
   dotprod(	global const float *a,
             global const float *b,
             global float *c)
      int myid = get_global_id(0);
      c[myid] = a[myid] * b[myid];


Intel is a leading participant in the OpenCL standard efforts, and a vendor of solutions and related tools with early implementations available today. OpenCL is applicable to multicore, many-core and GPU programming although the code within an OpenCL program is usually separate or duplicated for each target. Intel currently ships OpenCL support for both Intel multi-core processors (using Intel SSE and Intel AVX instructions) and Intel® HD Graphics (integrated graphics available as part of many Third Generation Intel® Core™ processors).

Composability Using Multiple Models

Composability is an important concept. With multiple programming options to fit differing needs, it is essential that these methods not be mutually exclusive. The abstract programming methods discussed above can be mixed in a single application. By offering newer programming models that support composable programming, programmers are freed from subtle and unproductive limitations on the mixing and matching of programming methods in a single application.

The most composable methods are Intel TBB and Intel Cilk Plus (including the C/C++ data parallel extensions). Use of OpenMP and OpenCL have limitations on composability, but are still preferable to the use of raw threads.

Intel TBB and Intel Cilk Plus provide recombinant components that can be selected and assembled for effective parallel programs. This is incredibly important since it offers composability for mixing modules including libraries. Both have self-composability, which is not the case for threading and for OpenMP or OpenCL.

Harnessing Many-core

Combining the power of both multicore and many-core, and utilizing them together, offers enormous possibilities.

Intel Xeon Phi coprocessors are designed to offer power efficient processing for highly parallel work while remaining highly programmable. Platforms containing both multicore processors and Intel Xeon Phi coprocessors can be referred to as heterogeneous platforms. Such a heterogeneous platform offers the best of both worlds, multicore processors that are very flexible to handle general-purpose serial and parallel workloads as well as more specialized many-core processing capabilities for highly parallel workloads. A heterogeneous platform can be programmed as such and utilize a programming model to manage copying of data and transfer of control.

Applications are still built with a single source base. The versatility of  Intel architecture multicore processors and many-core coprocessors allows for programming that is both intuitive and effective.

Explicit vs. Implicit use of Many-core

Many-core processors may be used implicitly through the use of libraries, like Intel MKL, by provisioning code to detect and utilize many-core processors when present. Explicit controls for Intel libraries are available to the developer, but the simple approach of relying on a library to decide if and when to use the attached multicore processors can be quite effective.

Additional programming opportunities are possible by explicit directions from the programmer in the source code. Writing an application to explicitly utilize many-core is done by writing a heterogeneous program. This program would consist of writing a parallel application and splitting the work between the multicore processors and many-core coprocessors.

Even with explicit control, Intel has designed the extensions to be flexible enough to work if no many-core processors are present and to also be ready for a converged future.  These two benefits are incredibly important. First, a single source program can provide direction to offload to an Intel Xeon Phi coprocessor. However at runtime, if the coprocessor is not present on the system being utilized, the use of  Intel architecture on both the multicore processors and many-core coprocessors means that the code available for offloading to a coprocessor can be executed seamlessly on either type of processor.


The reality of today’s hardware is that a heterogeneous platform contains multiple distinct memory spaces, one (or more in a cluster) for the multicore processors and one for each many-core processor. The connection between multicore processors and many-core coprocessors can be a bottleneck that needs some consideration.

There are two approaches to utilizing such a heterogeneous platform. One approach treats the memory spaces as wholly distinct, and uses offload directives to move control and data to and from the multicore processors. Another approach simplifies data concerns by utilizing a software illusion of shared memory called MYO to allow sharing between multicore processors and many-core coprocessors that reside in a single system or a single node on a cluster. MYO is an acronym for “Mine Yours Ours” and refers to the software abstraction that shares memory within a system for determining current access and ownership privileges.

The first approach exposes completely that the multicore processors and many-core processors do not share memory. Compiler support of directives for this execution model are able to free the programmer from specifying the low level details of the system, while exposing the fundamental property that the target is heterogeneous and leaving the programmer to devote their time to solving harder problems.

Simple offload, in Fortran:

!dir$ offload target(MIC1)
!$omp parallel do
      do i=1,10
         A(i) = B(i) * C(i)
!$omp end parallel


The compiler provides a pragma for offload (#pragma offload) that a programmer can use to indicate that the subsequent language construct may execute on the Intel Xeon Phil Coprocessor. The pragma also offers clauses that allow the programmer to specify data items that would need to be copied between processor and coprocessor memories before the offloaded code executes. The clauses also allow the developer to specify data that should be copied back to multicore processor memory afterwards. The offload pragma is available for C, C++ and Fortran.

Simple offload, in C, with data transfer:

float *a, *b; float *c;
#pragma offload target(MIC1) 
   in(a, b : length(s)) 
   out(c : length(s) alloc_if(0))
for (i=0; i<s; i++) {
   c[i] = a[i] + b[i];


An alternate approach is a run time user mode library called MYO. MYO allows synchronization of data between the multicore processors and an Intel Xeon Phi coprocessor, and with compiler support enabling allocation of data at the same virtual addresses. The implication is that data pointers can be shared between the multicore and many-core memory spaces. Copying of pointer based data structures such as trees, linked lists, etc. is supported fully without the need for data marshaling. To use the MYO capability, the programmer will mark data items that are meant to be visible from both sides with the _Cilk_shared keyword, and use offloading to invoke work on the Intel Xeon Phi coprocessor. The statement x = _Offload func(y); means that the function func() is executed on the Intel Xeon Phi coprocessor, and the return value is assigned to the variable x. The function may read and modify shared data as part of its execution, and the modified values will be visible for code executing on all processors.

The offload approach is very explicit and fits some programs quite well. The MYO approach is more implicit in nature and has several advantages. MYO allows copying of classes without marshaling and copying of C++ classes, which is not supported using offload pragmas. Importantly, MYO does not copy all shared variables upon synchronization. Instead, it only copies the values that have changed between two synchronization points.

These offload programming methods, while designed to allow control to direct work to many-core processors, are applicable to multicore and many-core programming so as to allow code to be highly portable and long lasting even as systems evolve. Source code will not need to differ for systems with and without many-core processors.

Additional Offload Capabilities

Both the keyword and pragma mechanisms perform the copying of the data triggered by the invocation of work on the Intel Xeon Phi coprocessor. Future directive options allow initiation of data copying ahead of invoking computation in order to be able to schedule other work while data is being copied.

Since systems may be configured with multicore processors, and more than one Intel Xeon Phi coprocessor per node, additional language support will allow the programmer to choose between forcing offloading or allowing a run time determination to be made. This option offers the potential for more dynamic and optimal decisions depending on the environment.


By utilizing Intel architecture instructions on multicore and many-core, programming tools and models are best able to serve both. With insights and use of the right models, a single source base can be constructed that is well equipped to utilize multicore processor systems, heterogeneous systems and future converged systems in an intuitive and effective manner.  This can be accomplished with a single source base in familiar and current programming languages.

Tried and true solutions, including C, C++, Fortran, OpenMP, MPI and Intel TBB apply to these  Intel architecture multicore and many-core systems.

Emerging efforts including Intel Cilk Plus, offload extensions and OpenCL are strongly supported by Intel, and are poised for broader adoption and support in the future.

The path to standardization starts with strong products and published specifications, progresses to users (customers) and support by additional vendors. Viable standards will follow. With OpenCL, Intel Cilk Plus, and offload extensions, the product support and specifications exist and customer usage is well under way. It is reasonable to expect that wider support and standards refined based on user experiences will follow.


By utilizing Intel architecture and industry standard programming tools, offer parallel programming methods that can be applied across both.  These methods can employ a single source code base using familiar tools, programming languages and previous source code investments. Current and emerging solutions allow applications to grow into a single code section that best utilizes multicore processors and many-core coprocessors together. The methods available to utilize multicore and many-core parallelism offer performance while preserving investments and offering intuitive programming methods.

Standards play an important role in programming methods. Intel has invested heavily to support and implement standard programming models and methods. In addition, Intel has been a leader in the evolution of standards to solve new challenges.

When programming for Intel Xeon Phi coprocessors, applications can get the power of the Intel Xeon Phi coprocessors in a maintainable and performant application that is highly portable and scales to future architectures while fully supporting multicore systems with the same code.

Clusters of multicore processors and many-core coprocessors, organized in nodes, will be able to take advantage of this very rich set of tools and programming models available for Intel architecture in an intuitive, maintainable and effective manner.

The ability to utilize existing developer tools, standards, performance and offer flexibility puts Intel multi-core and Intel many-core solutions in a class of their own.

About the Author

James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including systolic arrays systems WARP and iWarp, the world's first TeraFLOP/sec supercomputer (ASCI Red), and the world’s first TeraFLOP/sec single-chip computing device known at the time as Knights Corner and now as the first Intel® Xeon Phi™ Coprocessor, as well as compilers and architecture work for multiple Intel® processors and parallel systems. James has been a leader in the emergence of Intel as a major provider of software development products, and serves as their chief software evangelist. James is the author of “Intel Threading Building Blocks” from O'Reilly Media. It has been translated to Japanese, Chinese and Korean. James is coauthor of “Structured Parallel Programming,” ©2012, from Morgan Kaufmann Publishing and "Intel® Xeon Phi™ Coprocessor High Performance Programming," ©2013, from Morgan Kaufmann Publishing. James has published numerous articles, contributed to several books. James received his B.S.E. in Electrical and Computing Engineering and M.S.E. in Computer Engineering from the University of Michigan.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at