代码样本

整理您的数据和代码

这个由两部分组成的系列讨论数据和内存布局如何影响性能,并建议专门的步骤来提高软件性能。 所建议的基本步骤能使性能显著改善。 

使用 SIMD 编程优化数据布局

金融服务业客户持续不断地要求提高用于各种模型的金融算法的性能。 单指令多数据 (SIMD) 编程可加速这些工作负荷的运算。 

37 个搜索结果

Using Intel® Math Kernel Library Compiler Assisted Offload in Intel® Xeon Phi™ Processor

Introduction Beside native execution, another usage model of using the Intel® Math Kernel Library (Intel® MKL) on an Intel® Xeon Phi™ processor is the compiler assisted offload (CAO). The CAO usage model allows users to offload Intel MKL functions...

Improving Performance of Math Functions with Intel® Math Kernel Library

Introduction Intel® Math Kernel Library1 (Intel® MKL) is a product that accelerates math processing routines to increase the performance of an application when running on systems equipped with Intel® processors. Intel MKL includes linear algebra,...

Optimizing Computer Applications for Latency: Part 2: Tuning Applications

For applications such as high frequency trading (HFT), search engines and telecommunications, it is essential that latency can be minimized. My previous article Optimizing Computer Applications for Latency, looked at the architecture choices that...

Using Intel® MPI Library on Intel® Xeon Phi™ Product Family

This document is designed to help users get started writing code and running MPI applications using the Intel® MPI Library on a development platform that includes the Intel® Xeon Phi™ processor.

如何在英特尔® 至强融核™ 处理器中使用 MPI-3 共享内存

学习如何在英特尔® 至强融核™ 处理器中使用 MPI-3 共享内存

Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System

Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra. The algorithm for MM is very simple, it could be easily implemented in any programming language. This paper shows that performance significantly improves when different optimization techniques...

自动矢量化失败后应该怎么办?

This article completes an analysis of a problem erroneously reported on the Intel® Developer Zone forum: Vectorization failed because of unsigned integer? It provides a more detailed examination showing that unsigned integer is not impacting compiler vectorization but what methodology to use when...

How to use the MPI-3 Shared Memory in Intel® Xeon Phi™ Processors

Code Sample included: Learn how to use MPI-3 shared memory feature using the corresponding APIs on the Intel® Xeon Phi™ processor.

用于亚洲期权定价的 Monte Carlo 模拟

This is an exercise in performance optimization on heterogeneous Intel architecture systems based on multi-core processors and manycore (MIC) coprocessors.

英特尔® 至强® 处理器和英特尔® 至强融核™ 协处理器利用通用代码实现多线程方阵转置

In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is...

LU-decomposition

在英特尔® 至强融核™ 协处理器上微调矢量化和内存流量:对小型矩阵进行 LU 分解

Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel® Xeon Phi™ coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation.

What to Do When Auto-Vectorization Fails?

This article completes an analysis of a problem erroneously reported on the Intel® Developer Zone forum: Vectorization failed because of unsigned integer? It provides a more detailed examination showing that unsigned integer is not impacting compiler vectorization but what methodology to use when...

Resetting the lowest n set bits

Already a couple of years ago, the Bit Manipulation Instruction Set 1 (BMI1) introduced the instruction BLSR, which resets the lowest bit that is set. (The corresponding intrinsic _blsr_u32/64 wraps this instruction with some nice C/C++ function...

Intel® Xeon Phi™ Processor 7200 Family Memory Management Optimizations

This paper examines software performance optimization for an implementation of a non-library version of DGEMM executing on the Intel® Xeon Phi™ processor (code-named Knights Landing, with acronym KNL) running the Linux* Operating System (OS). The...

Exploring MPI for Python* on Intel® Xeon Phi™ Processor

Learn how to write an MPI program in Python*, and take advantage of Intel® multicore architectures using OpenMP threads and Intel® AVX512 instructions.

Thread Parallelism in Cython*

Cython* is a superset of Python* that additionally supports C functions and C types on variable and class attributes. Cython generates C extension modules, which can be used by the main Python program using the import statement.

英特尔® 至强融核™ 处理器优化教程

In this tutorial, we demonstrate some possible ways to optimize an application to run on the Intel® Xeon Phi™ processor

整理您的数据和代码: 数据和布局 - 第 2 部分

最后更新日期:2016 年 10 月 9 日作者:David M.

Apply the concepts of parallelism and distributed memory computing to your code to improve software performance. This paper expands on concepts discussed in Part 1, to consider parallelism, both vectorization (single instruction multiple data SIMD) as well as shared memory parallelism (threading),...

Improve Vectorization Performance with Intel® AVX-512

See how the new Intel® Advanced Vector Extensions 512CD and the Intel AVX512F subsets (available in the Intel® Xeon Phi processor and in future Intel Xeon processors) lets the compiler automatically generate vector code with no changes to the code.

Introduction to the Heterogeneous Streams Library

To efficiently utilize all available resources for the task concurrency application on heterogeneous platforms, designers need to understand the memory architecture, the thread utilization on each platform, the pipeline to offload the workload to different platforms. To relieve designers of the...

LU-decomposition

Fine-Tuning Vectorization and Memory Traffic on Intel® Xeon Phi™ Coprocessors: LU Decomposition of Small Matrices

Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel® Xeon Phi™ coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation.

Multithreaded Transposition of Square Matrices with Common Code for Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors

In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is...

Monte-Carlo simulation on Asian Options Pricing

This is an exercise in performance optimization on heterogeneous Intel architecture systems based on multi-core processors and manycore (MIC) coprocessors.

Direct N-body Simulation

Exercise in performance optimization on Intel Architecture, including Intel® Xeon Phi™ processors.

Caffe* Optimized for Intel® Architecture: Applying Modern Code Techniques

This paper demonstrates a special version of Caffe* — a deep learning framework originally developed by the Berkeley Vision and Learning Center (BVLC) — that is optimized for Intel® architecture.

Debug Intel® Transactional Synchronization Extensions

If printf or fprintf functions cause transaction aborts, use Intel® Processor Trace as a work-around.

Improve Application Performance on an Intel® Xeon Phi™ Processor

Learn techniques for vectorizing code, adding thread-level parallelism, and enabling memory optimization.

Code Sample: Allocate Memory Efficiently on an Intel® Xeon Phi™ Processor

How to efficiently use Multi-Channel DRAM (MCDRAM) and synchronous dynamic random-access memory.

Putting Your Data and Code in Order: Data and layout - Part 2

最后更新日期:2016 年 2 月 5 日作者:David M.

Apply the concepts of parallelism and distributed memory computing to your code to improve software performance. This paper expands on concepts discussed in Part 1, to consider parallelism, both vectorization (single instruction multiple data SIMD) as well as shared memory parallelism (threading),...

Putting Your Data and Code in Order: Optimization and Memory – Part 1

最后更新日期:2016 年 2 月 4 日作者:David M.

This series of two articles discusses how data and memory layout affect performance and suggests specific steps to improve software performance. The basic steps shown in these two articles can yield significant performance gains. These two articles are designed at an intermediate level. It is...

整理您的数据和代码: 优化和内存 — 第 1 部分

最后更新日期:2016 年 2 月 4 日作者:David M.

This series of two articles discusses how data and memory layout affect performance and suggests specific steps to improve software performance. The basic steps shown in these two articles can yield significant performance gains. These two articles are designed at an intermediate level. It is...

Caffe* Training on Multi-node Distributed-memory Systems Based on Intel® Xeon® Processor E5 Family

Caffe is a deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) and one of the most popular community frameworks for image recognition. Caffe is often used as a benchmark together with AlexNet*, a neural network topology for image recognition, and ImageNet*, a...

Single-Producer/Single-Consumer Queue

最后更新日期:2015 年 1 月 1 日作者:Dmitry Vyukov

Unbounded single-producer/single-consumer queue. Internal non-reducible cache of nodes is used. Dequeue operation is always wait-free. Enqueue operation is wait-free in common case. No atomic RMW operations nor heavy memory fences are used.

Monte Carlo European Option Pricing with RNG Interface for Intel® Xeon Phi™ Coprocessor

Download Available under the Intel Sample Source Code License Agreement license. Background Monte Carlo is a numerical method that uses statistical sampling techniques to approximate solutions to quantitative problems. The name comes from the...

GROMACS for Intel® Xeon Phi™ Coprocessor

  Purpose This code recipe describes how to get, build, and use the GROMACS* code with support for the Intel® Xeon Phi™ coprocessor with Intel® Many-Integrated Core (MIC) architecture. Introduction GROMACS is a versatile package to perform...

Binomial Options Pricing Model Code for Intel® Xeon Phi™ Coprocessor

Introduction The Binomial Options Pricing Model (BOPM) is a generalized numerical method used to value options in the quantitative Financial Services industry. To be accurate, it is a lattice-based approach that uses a discrete-time model of the...

Improving Averaging Filter Performance Using Intel® Cilk™ Plus

Intel® Cilk™ Plus is an extension to the C and C++ languages to support data and task parallelism.  It provides three new keywords to implement task parallelism and  Array Notation, simd pragma and Elemental Function to express data parallelism. ...