Porting Applications from Knights Corner to Knights Landing

This document discusses the changes that developers may need to make when porting an application already built for the Intel® Xeon Phi™ x100 coprocessor (Knights Corner or KNC) to the Intel® Xeon Phi™ x200 processor (Knights Landing or KNL) processor. This document shows the basic changes the developer will have to make, and does not focus on optimization. It is assumed that developers are already familiar with KNC.

To take advantage of KNL, and the Intel® Many Integrated Core Architecture in general, it is very important to effectively use the high number of available threads, vectorization, and memory bandwidth available on an Intel® Xeon Phi™ product. Cluster applications also need fabric scaling. If an application is tuned for KNC, it is likely to have compelling performance for KNL too.

The KNL processor will be the first release of Intel® Xeon Phi™ x200 products. Although there are a lot of similarities between the first and second generations of Intel® Xeon Phi™ products, there are significant differences too. This document highlights these differences.

Below is the summary of the hardware/software differences between the Intel® Xeon Phi™ Coprocessor x100 Product and Intel® Xeon Phi™ Processor x200 Product:

 

Intel® Xeon Phi™ Coprocessor x100

Intel® Xeon Phi™ Processor x200

Code Name

Knights Corner (KNC)

Knights Landing (KNL)

Process Technology

22 nanometer

14 nanometer

Number of Cores

in-order 61 Pentium cores

out-of-order 72 Atom cores

Processor/Coprocessor

Coprocessor

Processor

Frequency

1.2 GHz

1.2+ GHz

Hardware thread/core

4

4

On-Package Memory

Not available

16+ GB high bandwidth MCDRAM

Regular Memory

16 GB GDDR5

384 GB DDR4

512-bit SIMD vector registers

32

32

New ISA

Intel® IMIC Instruction Set Extensions

Intel® AVX-512

Binary

Unique

Compatible with legacy Intel® Xeon® processors

Optional Integrated Fabric

No

Yes in KNL-F

Intel® Manycore Platform Software Stack (Intel® MPSS)

Yes

No

Running a workload on the KNL processor is like running it on the Intel® Xeon® host itself, and not on a host assisted accelerator card. If an application is already tuned for the Intel® Xeon Phi™ coprocessor, it is very likely the application will run well for KNL with some minor changes.

To use KNC, we need to install MPSS on the Xeon host to communicate with the coprocessor. For the KNL processor, we can install RHEL* or SuSE* (and Windows*) directly on the processor and there is no need to use MPSS. All standard Intel tools are supported on KNL, including Intel Compiler C/C++, Intel Compiler Fortran, Intel MPI Libraries, Intel MKL, Intel Thread Building Block, VTune and the new Intel Advisor XE Vectorization, etc.

1.Intel® Software Developer Emulator (Intel® SDE)

At the time this document was written, KNL hardware was not yet released. However, developers can use the Intel® Software Developer Emulator (Intel® SDE) to emulate how their application will run on future KNL hardware. The current SDE (version 7.5) supports the Intel® Advanced Vector Extension 512 (Intel® AVX-512) which will be available in KNL. In fact, KNL supports Intel® AVX-512 Foundation, Intel® AVX-512 Conflict Detection Instructions (CDI), Intel® AVX-512 Exponential and Reciprocal Instructions (ERI) and Intel® AVX-512 Prefetch Instructions (PFI).

To install the SDE, developers can download the kit at https://software.intel.com/en-us/articles/intel-software-development-emulator. Download the SDE for Linux*, sde-external-7.15.0-2015-01-11.lin.tar.bz2, and move it to the installation directory then unpack the package:

# tar xvjf sde-external-xxx-lin.tar.bz2

 

A new sub-directory ./sde-external-xxx-lin is created. Set the path to the SDE:

# export PATH=<path_to_kit/sde-external-xxx-lin>:$PATH

 

The SDE supports a mix histogram tool, which can generate the instruction mix histogram. To emulate the KNL platform, use the option “-knl”. To generate the instruction mix histogram by instruction form, use the option “–mix –iform”. The top 20 basic blocks are always printed in the output file by default. To output the first top n basic blocks, use “–top_blocks n”.

For example, if you want to run a binary program, called application, on the KNL platform, and output the first 50 instruction forms:

# sde –knl -mix -iform 1 -top_blocks 50 -- ./application

 

This generates the instruction mix histogram report called sde-mix-out.txt.

For information on how to read the instruction mix histogram report, please refer to this white paper https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-code-named-knights-landing-application-readiness .

2.Porting Applications from KNC to KNL

This section provides suggestions for developers who are porting their KNC coprocessor applications to the KNL processor. Depending on each application, developers can refer to the following topics which can be useful for their applications.

Using Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

While KNC supports the Intel® Many Core Instructions (Intel® IMCI) instruction set, the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) are first implemented in KNL. AVX-512 is the successor of AVX2, and it extends the AVX SIMD instructions to 512-bit wide vectors. AVX-512 is compatible with older instruction sets such as SSE, AVX and AVX2.

Developers can find more information about AVX-512 in the following documents:

Although SIMD instructions may be called via inline assembly code or compiler intrinsics, they are platform dependent and unlikely to work between different platforms. Another way of using SIMD effectively is to use Intel® Cilk™ Plus and the compiler vectorization pragmas.

Some of the functionality supported by KNC intrinsics are supported in KNL; for example swizzle and permute exist in both KNC and KNL, however, some KNC intrinsics functionality are not supported in KNL. Therefore, if your application uses KNC intrinsics, you may refer to the above documents to check whether or not those intrinsics are available for KNL.

Note that 64-byte data alignment is preferred for vector instructions in both KNC and KNL. Some Fused Multiply-Add instructions (FMA instructions) are supported on both KNC and KNL (e.g., mm512_fmadd_ps(v1, v2, v3) is available in both architectures).

Compiling Code Using Intrinsics AVX-512

Starting with Intel® Compiler 14.0, KNL code generation is supported. To instruct the compiler to target KNL features, including the AVX-512 instruction set and optimization, one compiles the code with the flag “–xMIC-AVX512” in the Intel® Parallel Studio XE. For help on code generation, from the command prompt type

# icc –help codegen

Among these options, scroll to MIC-AVX512 for information:

-x<code>  generate specialized code to run exclusively on processors
          indicated by <code> as described below

            MIC-AVX512
                    May generate Intel(R) Advanced Vector Extensions 512
                    (Intel(R) AVX-512) Foundation instructions, Intel(R)
                    AVX-512 Conflict Detection instructions, Intel(R) AVX-512
                    Exponential and Reciprocal instructions, Intel(R) AVX-512
                    Prefetch instructions for Intel(R) processors, and the
                    instructions enabled with CORE-AVX2. Optimizes for Intel(R)
                    processors that support Intel(R) AVX-512 instructions.

By default, Intel® compilers use option -O2 which performs autovectorization.

For an application previously compiled for an Intel® Xeon® processor, the binary should work on the KNL processor with some rare exceptions. We can take advantage of AVX-512 by recompiling the application as shown below:

First, source the environment variables as usual:

# source <path_to_install_dir>/compilervars.sh intel64

Recompile the application with option “-xMIC-AVX512” and generate the binary called application.knl:

# icc <application.c> -xMIC-AVX512 -o application.knl

To get the optimization report, compile with “–qopt-report” flag

# icc <application.c> -xMIC-AVX512 –qopt-report -o application.knl
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location

Appendix A shows a simple program using different types of shuffle and how multiple intrinsics are compiled with AVX-512. Using the SDE, we can emulate this application on KNL:

# icc -xMIC-AVX512 shuffle_sample.c -o shuffle_sample.knl
# sde -knl -- ./shuffle_sample.knl

Vector input1:
  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15

Vector output resulting from shuffle data for pattern 'AAAA':
  0   0   0   0   4   4   4   4   8   8   8   8  12  12  12  12

Vector output resulting from shuffle data for pattern 'ABCD':
  3   2   1   0   7   6   5   4  11  10   9   8  15  14  13  12

Vector output resulting from permute data for pattern 'ABCD':
 12  13  14  15   8   9  10  11   4   5   6   7   0   1   2   3

Vector input2:
  0   1   2   3   0   0   0   0   1   1   1   1   2   2   2   2

Vector output resulting from multiple input1 and input2:
  0   1   4   9   0   0   0   0   8   9  10  11  24  26  28  30

 

Offload Model is Not Needed for the KNL Processor

The KNL processor can be booted as a host processor: it supports various Linux OSs such as SuSE*, RHEL*, and Windows*. As such, there is no need to offload. If an application running on a Xeon® host offloads to KNC, then developers can disable the offload by recompiling the application with  

“-qoffload=none” to build a non-offload version (assuming that the application also performs the computation on the host).

# icc –MIC-AVX512 –qoffload=none application.c

Note that the #pragma offload is not used in a program written for the KNL processor.

If a KNC application is supposed to be compiled natively to run on KNC (using flag “-mmic”), then this application can be recompiled to run on the KNL processor without the flag “-mmic”:

# icc -MIC-AVX512 application.c

To run a native KNC application, the micnativeloadex tool can be used. This tool detects all dependent libraries, then transfers the native application and all dependent libraries to the coprocessor and finally runs it on the coprocessor. To run an application on the KNL processor, we just launch the application directly in the processor; the micnativeloadex tool is not necessary anymore.

Appendix B shows a program using the offload model on KNC. We need to recompile it without the offload option for the KNL processor:

# icc –xMIC-AVX512 –qoffload=none sample.c –o sample.knl

Note that the option “–no-offload” is deprecated and will be removed in a future release of Intel compiler. Use “–qno-offload” instead. Finally, use the sde to emulate a knl platform:

# sde -knl -mix -top_blocks 100 -iform 1 -- ./sample.knl

Number of Target devices installed: 0
Offload section is executed on Host (fallback mode)
Elements of array are set to 1, 1,..................., 1, 1

 

Prefetching

KNL has better hardware prefetching, and therefore less need for software prefetching (i.e. where developers insert prefetching intrinsics manually in the code). You may refer to the following document for information on different optimization techniques including prefetching data: Optimization and Performance tuning for Intel® Xeon Phi™ Coprocessors

Memory Alignment

Similar to KNC data alignment, KNL data alignment is on 64-byte boundaries. The below example shows how to allocate and align data to 64-byte in memory

__float array[n] __attribute__((align(64)));

Or align data to 64-byte on the heap

float *array = () _mm_malloc(n * sizeof(float), 64);

Elemental Function

Similar to KNC, Elemental Functions are supported in KNL.

Virtual-Shared Memory MYO

In KNC, the Virtual Shared Memory model, also called Mine Yours Ours (MYO), is used to share virtual memory between Xeon® host and coprocessors. This method is useful in sharing complex objects such as C++ classes.  However, in the KNL processor, this Virtual Shared Memory model is not needed. Therefore, the following keyword extensions are not applicable in a KNL processor environment:   __Cilk_shared, __Cilk_offload, __Cilk_offload_to__Offload_shared_malloc, __Offload_shared_aligned_malloc, etc

COI and SCIF

COI and SCIF don’t exist for the KNL processor. However, COI and SCIF will be available in the KNL coprocessor as they are in KNC today. In the KNL-F processor (i.e. the version with integrated fabrics), COI and SCIF will allow communication over fabric with other nodes on the network.

Intel® Math Kernel Library (Intel® MKL):

Intel® MKL is a computational math library that optimizes math functions performance on Intel® Architecture (IA) platforms. There are three mode of operations with KNC: Automatic offload (AO), Compiler Assist Offload (CAO), and Native Execution. For the KNL processor, developers don’t need to set the environment variable like MKL_MIC_ENABLE, MIC_ENV_PREFIX, OFFLOAD_REPORT .

MKL treats the KNL processor in the same manner as it treats a Xeon® processor. To use MKL, the command line argument –mkl is still needed and MKL header files must be included. MKL code will be then dispatched automatically.

With limited support, MKL 11.2 and updated MKL has pre-silicon KNL optimizations. These optimizations are dispatched only if mkl_enable_instructions(AVX512_MIC) call was made.  This restriction will be removed in MKL 11.3 which will be aligned with silicon availability.

Task Parallel in Shared Memory

Similar to KNC, OpenMP* and Intel Cilk Plus will be supported by KNL for task parallelism.

To use OpenMP pragmas, the command line argument –openmp is needed and the header file omp.h must be included in the application code. However, since the application is running on a host (the KNL processor), all environment variables are set for the host and the MIC prefix is not needed anymore. For example, developers don’t need to set MIC_ENV_PREFIX, MIC_OMP_NUM_THREADS,.. but instead use KMP_AFFINITY, MKL_NUM_THREADS, OMP_NUM_THREADS, etc.  on KNL self-boot systems.

Intel Cilk Plus keywords _Cilk_for, _Cilk_spawn and _Cilk_sync (but not _Cilk_offload and _Cilk_shared as mentioned previously) are needed. To use Intel Cilk Plus, the header file cilk.h must be included in the application code.

Task parallel in distributed memory

Similar to KNC, Message Passing Interface (MPI) is used in KNL systems to support task parallelism in multiple nodes.

However, there are some minor differences compared to KNC: offload mode is no longer applicable; symmetric mode needs to be converted to run on the KNL processor only (compilation flag “-mmic” is no longer needed and developers don’t need to set the environment variable I_MPI_MIC). Running a MPI application on a KNL processor is like running it on a Xeon multi-core host, but with many more cores. Also, MPI application can run on a cluster where nodes are either multi-core Xeon processors or a KNL processor. Note that building applications for the KNL processor is lightly different than building it for a traditional as mentioned earlier.  In the future, with the Integrated Fabric in KNL-F, communication among these nodes will be even better.

It is worth noting that the number of MPI ranks used on KNC had to be substantially fewer than the number of cores because of KNC’s limited memory, but this rule is relaxed for the KNL processor.

Appendix C shows a MPI sample code that uses intrinsics to mimic multiplication. The following commands shows how to build the binary using AVX-512 and use the SDE to emulate KNL in order to start 4 ranks which print out the multiplication table:

# source /opt/intel/impi/5.0.3.048/intel64/bin/mpivars.sh
# mpiicc -xMIC-AVX512 -qopt-report=3 -qopt-report-phase=vec mpi_vect_sample.c \       -o mpi_vect_sample.knl
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location

# mpirun -n 4 sde -knl -mix -top_blocks 100 -iform 1 -- ./mpi_vect_sample.knl

Hello world: rank 0 of 4 running on knightscorner1
 0 x  2 =  0
 0 x  1 =  0
 0 x  3 =  0
 1 x  2 =  2
 2 x  2 =  4
 3 x  2 =  6
 4 x  2 =  8
 5 x  2 = 10
 6 x  2 = 12
 7 x  2 = 14
 8 x  2 = 16
 9 x  2 = 18
10 x  2 = 20
11 x  2 = 22
12 x  2 = 24
13 x  2 = 26
14 x  2 = 28
15 x  2 = 30
 1 x  1 =  1
 2 x  1 =  2
 3 x  1 =  3
 4 x  1 =  4
 5 x  1 =  5
 6 x  1 =  6
 7 x  1 =  7
 8 x  1 =  8
 9 x  1 =  9
10 x  1 = 10
11 x  1 = 11
12 x  1 = 12
13 x  1 = 13
14 x  1 = 14
15 x  1 = 15
 1 x  3 =  3
 2 x  3 =  6
 3 x  3 =  9
 4 x  3 = 12
 5 x  3 = 15
 6 x  3 = 18
 7 x  3 = 21
 8 x  3 = 24
 9 x  3 = 27
10 x  3 = 30
11 x  3 = 33
12 x  3 = 36
13 x  3 = 39
14 x  3 = 42
15 x  3 = 45
Hello world: rank 1 of 4 running on knightscorner1
Hello world: rank 2 of 4 running on knightscorner1
Hello world: rank 3 of 4 running on knightscorner1

3.Conclusion

Applications that are already tuned for KNC can be used on the KNL processor with minor changes. To use KNL effectively, applications should effectively use VPU instructions on vector data, good locality of reference, and utilizes caches well in its core computations.

Optimization methods that benefit applications for KNC should also apply to the KNL processor, although some minor changes will probably be required.

 

Appendix A: KNL Intrinsics Sample Code

/* 
// Copyright 2003-2015 Intel Corporation. All Rights Reserved. 
//  
// The source code contained or described herein and all documents related  
// to the source code ("Material") are owned by Intel Corporation or its 
// suppliers or licensors.  Title to the Material remains with Intel Corporation 
// or its suppliers and licensors.  The Material is protected by worldwide 
// copyright and trade secret laws and treaty provisions.  No part of the 
// Material may be used, copied, reproduced, modified, published, uploaded, 
// posted, transmitted, distributed, or disclosed in any way without Intel's 
// prior express written permission. 
//  
// No license under any patent, copyright, trade secret or other intellectual 
// property right is granted to or conferred upon you by disclosure or delivery 
// of the Materials, either expressly, by implication, inducement, estoppel 
// or otherwise.  Any license under such intellectual property rights must 
// be express and approved by Intel in writing. 

 #****************************************************************************** 
 # Content: (version 0.1) 
 # shuffle_sample.c : Sample intrinsic code for Intel(R) Xeon Phi(TM) Processor x200  
 #       
 #*****************************************************************************/ 

#include <immintrin.h>
#include <zmmintrin.h>

int main()
{
   int i;
   _MM_PERM_ENUM p32;

   __m512i zmm = _mm512_setzero_si512 ();
   zmm = _mm512_set1_epi64 (0x12349876);
 
   __attribute__((aligned(64))) int input1[16] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
   __attribute__((aligned(64))) int output[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

   printf("Vector input1:\n");
   for (i=0; i<16; i++)
   {
      printf("%3d ", input1[i]);
   }
   printf("\n\n");

   __m512i vin1 = _mm512_loadu_si512(input1);
   __m512i vout;

   p32 = _MM_PERM_AAAA;
   vout = _mm512_shuffle_epi32(vin1, p32);
   _mm512_storeu_si512(output, vout);

   printf("Vector output resulting from shuffle data for pattern 'AAAA':\n");
   for (i=0; i<16; i++)
   {
      printf("%3d ", output[i]);
   }
   printf("\n\n");

   p32 = _MM_PERM_ABCD;
   vout = _mm512_shuffle_epi32(vin1, p32);
   _mm512_storeu_si512(output, vout);
   
   printf("Vector output resulting from shuffle data for pattern 'ABCD':\n");
   for (i=0; i<16; i++)
   {
      printf("%3d ", output[i]);
   }
   printf("\n\n");

   vout = _mm512_permute4f128_epi32(vin1, p32);
   _mm512_storeu_si512(output, vout);
   
   printf("Vector output resulting from permute data for pattern 'ABCD':\n");
   for (i=0; i<16; i++)
   {
      printf("%3d ", output[i]);
   }
   printf("\n\n");

   int input2[16] = {0, 1, 2, 3, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2};
   __m512i vin2 = _mm512_loadu_si512(input2);

   printf("Vector input2:\n");
   for (i=0; i<16; i++)
   {
      printf("%3d ", input2[i]);
   }
   printf("\n\n");
   
   vout = _mm512_mullo_epi32(vin1, vin2);
   _mm512_storeu_si512(output, vout);
   
   printf("Vector output resulting from multiple input1 and input2:\n");
   for (i=0; i<16; i++)
   {
      printf("%3d ", output[i]);
   }
   printf("\n\n");


   return 0;
}

 

Appendix B: Convert Sample Offload Code

/*
// Copyright 2003-2015 Intel Corporation. All Rights Reserved.
//
// The source code contained or described herein and all documents related
// to the source code ("Material") are owned by Intel Corporation or its
// suppliers or licensors.  Title to the Material remains with Intel Corporation
// or its suppliers and licensors.  The Material is protected by worldwide
// copyright and trade secret laws and treaty provisions.  No part of the
// Material may be used, copied, reproduced, modified, published, uploaded,
// posted, transmitted, distributed, or disclosed in any way without Intel's
// prior express written permission.
//
// No license under any patent, copyright, trade secret or other intellectual
// property right is granted to or conferred upon you by disclosure or delivery
// of the Materials, either expressly, by implication, inducement, estoppel
// or otherwise.  Any license under such intellectual property rights must
// be express and approved by Intel in writing.

 #******************************************************************************
 # Content: (version 0.1)
 # sample.c : Sample code for Intel(R) Xeon Phi(TM) Processor x200
 #
 #*****************************************************************************/

#include <stdio.h>
#include <offload.h>

#define N 1024

__attribute__((target(mic))) void set();

#pragma offload_attribute(push, target(mic))
static int array[N];
#pragma offload_attribute(pop)

int main (int argc, char* argv[])
{
   int num = 0;
   int i;

   for (i=0; i<N; i++)
      array[i] = 0;

#ifdef __INTEL_OFFLOAD
   num = _Offload_number_of_devices();
#endif

   printf("Number of Target devices installed: %d\n\n",num);

   if (num < 1) {
      // Run in fallback-mode
      printf("Offload section is executed on Host (fallback mode)\n\n");
 
      for (i=0; i<N; i++)
         array[i] = 1;
   }
   else {
      printf("Offload section is executed on MIC (offload mode)\n\n");

         #pragma offload target(mic)
         set(array, array);
   }
   
   printf("Elements of array are set to %d, %d,..................., %d, %d\n", array[0], array[1], array[N-2], array[N-1]);
   return 0;
}

__attribute__((target(mic))) void set()
{
   int i;
   for (i=0; i<N; i++)
      array[i] = 2;
}

 

Appendix C: Sample MPI Code

/*
// Copyright 2003-2015 Intel Corporation. All Rights Reserved.
//
// The source code contained or described herein and all documents related
// to the source code ("Material") are owned by Intel Corporation or its
// suppliers or licensors.  Title to the Material remains with Intel Corporation
// or its suppliers and licensors.  The Material is protected by worldwide
// copyright and trade secret laws and treaty provisions.  No part of the
// Material may be used, copied, reproduced, modified, published, uploaded,
// posted, transmitted, distributed, or disclosed in any way without Intel's
// prior express written permission.
//
// No license under any patent, copyright, trade secret or other intellectual
// property right is granted to or conferred upon you by disclosure or delivery
c// of the Materials, either expressly, by implication, inducement, estoppel
// or otherwise.  Any license under such intellectual property rights must
// be express and approved by Intel in writing.

 #******************************************************************************
 # Content: (version 0.1)
 # mpi_vect_sample.c : Sample MPI code for Intel(R) Xeon Phi(TM) Processor x200
 #
 #*****************************************************************************/

#include <stdio.h>
#include <immintrin.h>
#include "mpi.h"

#define MASTER 0
#define TAG_HELLO 4
#define ARRAY_SIZE (1024*1024)

int main(int argc, char *argv[])
{
  int i, id, remote_id, num_procs;
   
  MPI_Status stat;
  int namelen;
  char name[MPI_MAX_PROCESSOR_NAME];

  if (MPI_Init (&argc, &argv) != MPI_SUCCESS)
    {
      printf ("Failed to initialize MPI\n");
      return (-1);
    }

  MPI_Comm_size (MPI_COMM_WORLD, &num_procs);
  MPI_Comm_rank (MPI_COMM_WORLD, &id);
  MPI_Get_processor_name (name, &namelen);
  
  if (id == MASTER)
    {
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name);

      for (i = 1; i<num_procs; i++) 
 { 
   MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat); 
   MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);    
   MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);   
   MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
   
   printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name);
 }
    }
  else   
    {
      const int size = 16;
      int* V0 = (int*) _mm_malloc(size*sizeof(int), 64);
      int* V1 = (int*) _mm_malloc(size*sizeof(int), 64);
      int* V2 = (int*) _mm_malloc(size*sizeof(int), 64);

      for (i=0; i< size; i++)
      {
         V0[i] = i;
         V1[i] = id;
         V2[i] = 0;
      }

      for (i=0; i<size; i+=16)
      {
         __m512i r0 = _mm512_load_epi32(V0 + i);
         __m512i r1 = _mm512_load_epi32(V1 + i);
         //r2 = _mm512_add_epi32(r1, r2);
         r0 = _mm512_mullo_epi32(r0, r1);
         _mm512_store_epi32(V2 + i, r0);
      }

      for (i=0; i<size; i++)
         printf("%2d x %2d = %2d\n", V0[i], V1[i], V2[i]);

      _mm_free((int *) V0);
      _mm_free((int *) V1);
      _mm_free((int *) V2);
  
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD);
    }
 
  MPI_Finalize();
  
  return 0;
}

 

For more complete information about compiler optimizations, see our Optimization Notice.