代码样本

整理您的数据和代码

这个由两部分组成的系列讨论数据和内存布局如何影响性能,并建议专门的步骤来提高软件性能。 所建议的基本步骤能使性能显著改善。 

使用 SIMD 编程优化数据布局

金融服务业客户持续不断地要求提高用于各种模型的金融算法的性能。 单指令多数据 (SIMD) 编程可加速这些工作负荷的运算。 

23 个搜索结果

Code Sample: Optimizing Binarized Neural Networks on Intel® Xeon® Scalable Processors

In the previous article, we discussed the performance and accuracy of Binarized Neural Networks (BNN). We also introduced a BNN coded from scratch in the Wolfram Language. The key component of this neural network is Matrix Multiplication.

Code Sample: Exploring MPI for Python* on Intel® Xeon Phi™ Processor

发布时间:!datetime,更新时间: 2018 年 4 月 4 日作者:Nguyen, Loc Q

Learn how to write an MPI program in Python*, and take advantage of Intel® multicore architectures using OpenMP threads and Intel® AVX512 instructions.

Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

发布时间:!datetime,更新时间: 2018 年 1 月 9 日作者:Smahane Douyeb.

MILC software represents a set of codes written by the MIMD Lattice Computation collaboration used to study quantum chromodynamics. This article provides instructions for code access, build and run directions for the “ks_imp_rhmc” application on Intel® Xeon® Gold and Intel® Xeon Phi™ processors for...

Using Intel® Math Kernel Library Compiler Assisted Offload in Intel® Xeon Phi™ Processor

Introduction

Beside native execution, another usage model of using the Intel® Math Kernel Library (Intel® MKL) on an Intel® Xeon Phi™ processor is the compiler assisted offload (CAO). The CAO usage model allows users to offload Intel MKL...

Using Intel® MPI Library on Intel® Xeon Phi™ Product Family

This document is designed to help users get started writing code and running MPI applications using the Intel® MPI Library on a development platform that includes the Intel® Xeon Phi™ processor.

如何在英特尔® 至强融核™ 处理器中使用 MPI-3 共享内存

学习如何在英特尔® 至强融核™ 处理器中使用 MPI-3 共享内存

Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System

发布时间:2017 年 4 月 14 日

Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra. The algorithm for MM is very simple, it could be easily implemented in any programming language. This paper shows that performance significantly improves when different optimization techniques...

How to use the MPI-3 Shared Memory in Intel® Xeon Phi™ Processors

Code Sample included: Learn how to use MPI-3 shared memory feature using the corresponding APIs on the Intel® Xeon Phi™ processor.

用于亚洲期权定价的 Monte Carlo 模拟

发布时间:2017 年 2 月 23 日作者:Mike P.

This is an exercise in performance optimization on heterogeneous Intel architecture systems based on multi-core processors and manycore (MIC) coprocessors.

Resetting the lowest n set bits

Already a couple of years ago, the Bit Manipulation Instruction Set 1 (BMI1) introduced the instruction BLSR, which resets the lowest bit that is set. (The corresponding intrinsic _blsr_u32/64 wraps this instruction with some nice C/C++ function...

Intel® Xeon Phi™ Processor 7200 Family Memory Management Optimizations

发布时间:2016 年 12 月 22 日

This paper examines software performance optimization for an implementation of a non-library version of DGEMM executing on the Intel® Xeon Phi™ processor (code-named Knights Landing, with acronym KNL) running the Linux* Operating System (OS). The...

Thread Parallelism in Cython*

Cython* is a superset of Python* that additionally supports C functions and C types on variable and class attributes. Cython generates C extension modules, which can be used by the main Python program using the import statement.

英特尔® 至强融核™ 处理器优化教程

In this tutorial, we demonstrate some possible ways to optimize an application to run on the Intel® Xeon Phi™ processor

Introduction to the Heterogeneous Streams Library

To efficiently utilize all available resources for the task concurrency application on heterogeneous platforms, designers need to understand the memory architecture, the thread utilization on each platform, the pipeline to offload the workload to different platforms. To relieve designers of the...

Monte-Carlo simulation on Asian Options Pricing

发布时间:2016 年 9 月 7 日作者:Mike P.

This is an exercise in performance optimization on heterogeneous Intel architecture systems based on multi-core processors and manycore (MIC) coprocessors.

Direct N-body Simulation

发布时间:2016 年 9 月 7 日作者:Mike P.

Exercise in performance optimization on Intel Architecture, including Intel® Xeon Phi™ processors.

Debug Intel® Transactional Synchronization Extensions

If printf or fprintf functions cause transaction aborts, use Intel® Processor Trace as a work-around.

Improve Application Performance on an Intel® Xeon Phi™ Processor

Learn techniques for vectorizing code, adding thread-level parallelism, and enabling memory optimization.

Code Sample: Allocate Memory Efficiently on an Intel® Xeon Phi™ Processor

发布时间:2016 年 3 月 10 日作者:Mike P.

How to efficiently use Multi-Channel DRAM (MCDRAM) and synchronous dynamic random-access memory.

Putting Your Data and Code in Order: Optimization and Memory – Part 1

发布时间:2016 年 2 月 4 日作者:David M.

This series of two articles discusses how data and memory layout affect performance and suggests specific steps to improve software performance. The basic steps shown in these two articles can yield significant performance gains. These two articles are designed at an intermediate level. It is...

整理您的数据和代码: 优化和内存 — 第 1 部分

发布时间:2016 年 2 月 4 日作者:David M.

This series of two articles discusses how data and memory layout affect performance and suggests specific steps to improve software performance. The basic steps shown in these two articles can yield significant performance gains. These two articles are designed at an intermediate level. It is...

Single-Producer/Single-Consumer Queue

发布时间:!datetime,更新时间: 2015 年 1 月 1 日作者:Dmitry Vyukov

Unbounded single-producer/single-consumer queue. Internal non-reducible cache of nodes is used. Dequeue operation is always wait-free. Enqueue operation is wait-free in common case. No atomic RMW operations nor heavy memory fences are used.

Improving Averaging Filter Performance Using Intel® Cilk™ Plus

发布时间:2013 年 7 月 25 日作者:Anoop M.

Intel® Cilk™ Plus is an extension to the C and C++ languages to support data and task parallelism.  It provides three new keywords to implement task parallelism and  Array Notation, simd pragma and Elemental Function to express data parallelism...