Putting Your Data and Code in Order

This two-part series discusses how data and memory layout affect performance and suggests specific steps to improve software performance. The basic steps shown can yield significant performance gains. 

Data Layout Optimization Using SIMD Programming

Financial service customers have a continual need to improve financial algorithmic performance for a wide range of models. Single Instruction Multiple Data (SIMD) programming can speed up these workloads. 

30 Search Results

Code Sample: Exploring MPI for Python* on Intel® Xeon Phi™ Processor

Learn how to write an MPI program in Python*, and take advantage of Intel® multicore architectures using OpenMP threads and Intel® AVX512 instructions.

Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

MILC software represents a set of codes written by the MIMD Lattice Computation collaboration used to study quantum chromodynamics. This article provides instructions for code access, build and run directions for the “ks_imp_rhmc” application on Intel® Xeon® Gold and Intel® Xeon Phi™ processors for...

Using Intel® Math Kernel Library Compiler Assisted Offload in Intel® Xeon Phi™ Processor


Beside native execution, another usage model of using the Intel® Math Kernel Library (Intel® MKL) on an Intel® Xeon Phi™ processor is the compiler assisted offload (CAO). The CAO usage model allows users to offload Intel MKL...

Improving Performance of Math Functions with Intel® Math Kernel Library


Intel® Math Kernel Library1 (Intel® MKL) is a product that accelerates math processing routines to increase the performance of an application when running on systems equipped with Intel® processors. Intel MKL includes linear algebra...

Optimizing Computer Applications for Latency: Part 2: Tuning Applications

For applications such as high frequency trading (HFT), search engines and telecommunications, it is essential that latency can be minimized. My previous article Optimizing Computer Applications for Latency, looked at the architecture choices that...

Using Intel® MPI Library on Intel® Xeon Phi™ Product Family

This document is designed to help users get started writing code and running MPI applications using the Intel® MPI Library on a development platform that includes the Intel® Xeon Phi™ processor.

Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System

Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra. The algorithm for MM is very simple, it could be easily implemented in any programming language. This paper shows that performance significantly improves when different optimization techniques...

How to use the MPI-3 Shared Memory in Intel® Xeon Phi™ Processors

Code Sample included: Learn how to use MPI-3 shared memory feature using the corresponding APIs on the Intel® Xeon Phi™ processor.

What to Do When Auto-Vectorization Fails?

This article completes an analysis of a problem erroneously reported on the Intel® Developer Zone forum: Vectorization failed because of unsigned integer? It provides a more detailed examination showing that unsigned integer is not impacting compiler vectorization but what methodology to use when...

Resetting the lowest n set bits

Already a couple of years ago, the Bit Manipulation Instruction Set 1 (BMI1) introduced the instruction BLSR, which resets the lowest bit that is set. (The corresponding intrinsic _blsr_u32/64 wraps this instruction with some nice C/C++ function...

Intel® Xeon Phi™ Processor 7200 Family Memory Management Optimizations

This paper examines software performance optimization for an implementation of a non-library version of DGEMM executing on the Intel® Xeon Phi™ processor (code-named Knights Landing, with acronym KNL) running the Linux* Operating System (OS). The...

Thread Parallelism in Cython*

Cython* is a superset of Python* that additionally supports C functions and C types on variable and class attributes. Cython generates C extension modules, which can be used by the main Python program using the import statement.

Improve Vectorization Performance with Intel® AVX-512

See how the new Intel® Advanced Vector Extensions 512CD and the Intel AVX512F subsets (available in the Intel® Xeon Phi processor and in future Intel Xeon processors) lets the compiler automatically generate vector code with no changes to the code.

Introduction to the Heterogeneous Streams Library

To efficiently utilize all available resources for the task concurrency application on heterogeneous platforms, designers need to understand the memory architecture, the thread utilization on each platform, the pipeline to offload the workload to different platforms. To relieve designers of the...


Fine-Tuning Vectorization and Memory Traffic on Intel® Xeon Phi™ Coprocessors: LU Decomposition of Small Matrices

Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel® Xeon Phi™ coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation.

Multithreaded Transposition of Square Matrices with Common Code for Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors

In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is...

Monte-Carlo simulation on Asian Options Pricing

This is an exercise in performance optimization on heterogeneous Intel architecture systems based on multi-core processors and manycore (MIC) coprocessors.

Direct N-body Simulation

Exercise in performance optimization on Intel Architecture, including Intel® Xeon Phi™ processors.

Caffe* Optimized for Intel® Architecture: Applying Modern Code Techniques

This paper demonstrates a special version of Caffe* — a deep learning framework originally developed by the Berkeley Vision and Learning Center (BVLC) — that is optimized for Intel® architecture.

Debug Intel® Transactional Synchronization Extensions

If printf or fprintf functions cause transaction aborts, use Intel® Processor Trace as a work-around.

Improve Application Performance on an Intel® Xeon Phi™ Processor

Learn techniques for vectorizing code, adding thread-level parallelism, and enabling memory optimization.

Code Sample: Allocate Memory Efficiently on an Intel® Xeon Phi™ Processor

How to efficiently use Multi-Channel DRAM (MCDRAM) and synchronous dynamic random-access memory.

Putting Your Data and Code in Order: Data and layout - Part 2

Published on February 5, 2016By David M.

Apply the concepts of parallelism and distributed memory computing to your code to improve software performance. This paper expands on concepts discussed in Part 1, to consider parallelism, both vectorization (single instruction multiple data SIMD) as well as shared memory parallelism (threading),...

Putting Your Data and Code in Order: Optimization and Memory – Part 1

Published on February 4, 2016By David M.

This series of two articles discusses how data and memory layout affect performance and suggests specific steps to improve software performance. The basic steps shown in these two articles can yield significant performance gains. These two articles are designed at an intermediate level. It is...

Caffe* Training on Multi-node Distributed-memory Systems Based on Intel® Xeon® Processor E5 Family

Caffe is a deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) and one of the most popular community frameworks for image recognition. Caffe is often used as a benchmark together with AlexNet*, a neural network topology for image recognition, and ImageNet*, a...

Single-Producer/Single-Consumer Queue

Published on February 28, 2009, updated January 1, 2015By Dmitry Vyukov

Unbounded single-producer/single-consumer queue. Internal non-reducible cache of nodes is used. Dequeue operation is always wait-free. Enqueue operation is wait-free in common case. No atomic RMW operations nor heavy memory fences are used.

Monte Carlo European Option Pricing with RNG Interface for Intel® Xeon Phi™ Coprocessor

Download Available under the Intel Sample Source Code License Agreement license.


Monte Carlo is a numerical method that uses statistical sampling techniques to approximate solutions to quantitative problems. The name comes...

GROMACS for Intel® Xeon Phi™ Coprocessor


This code recipe describes how to get, build, and use the GROMACS* code with support for the Intel® Xeon Phi™ coprocessor with Intel® Many-Integrated Core (MIC) architecture.


GROMACS is a versatile package to...

Binomial Options Pricing Model Code for Intel® Xeon Phi™ Coprocessor


The Binomial Options Pricing Model (BOPM) is a generalized numerical method used to value options in the quantitative Financial Services industry. To be accurate, it is a lattice-based approach that uses a discrete-time model of the...

Improving Averaging Filter Performance Using Intel® Cilk™ Plus

Intel® Cilk™ Plus is an extension to the C and C++ languages to support data and task parallelism.  It provides three new keywords to implement task parallelism and  Array Notation, simd pragma and Elemental Function to express data parallelism...