Common Best Known Methods for Parallel Performance, from Intel® Xeon® to Intel® Xeon Phi™ Processors

To improve the performance of applications and kernels we are constantly on the search for novel Best Known Methods or BKMs, but as our searches grow more esoteric, it is important to keep in mind the basics and how many performance improvements rely on them.  This article will describe some common BKMs for improving parallel performance and show their application over this spectrum of processor architectures.  The advice collected here should help you speed up your code, whether running on an Intel® Xeon Phi™ coprocessor or an Intel Xeon process

  • Developers
  • Professors
  • Students
  • C/C++
  • Fortran
  • Intermediate
  • Intel® C++ Composer XE
  • Intel® Fortran Composer XE
  • Intel® VTune™ Amplifier XE
  • Best Known Methods
  • parallel performance
  • Message Passing Interface
  • OpenMP*
  • Intel® Core™ Processors
  • Intel® Many Integrated Core Architecture
  • Optimization
  • Parallel Computing
  • Threading
  • Vectorization
  • Fun with Intel® Transactional Synchronization Extensions

    By now, many of you have heard of Intel® Transactional Synchronization Extensions (Intel® TSX). If you have not, I encourage you to check out this page ( before you read further. In a nutshell, Intel TSX provides transactional memory support in hardware, making the lives of developers who need to write synchronization codes for concurrent and parallel applications easier.

    Introduction to the Intel® Xeon Phi™ Coprocessor

    This tutorial introduces the basic hardware and software architecture of the Intel Xeon Phi coprocessor, describing their general features, and provides a first view of the various programming models that support High Performance Computing on hosts equipped with the coprocessor, including offloading selected code to the coprocessor, Virtual Shared Memory, and parallel programming using OpenMP*, Intel Cilk Plus, and Intel Threading Building Blocks.

  • Developers
  • Intel® Cilk™ Plus
  • Intel® Threading Building Blocks
  • Intel® Xeon Phi™ Coprocessor
  • OpenMP*
  • Setting number_of_user_threads for Intel® Math Kernel Library FFTW3 wrappers

    Consider the case when you

    • Create a FFTW3 plan and use the plan for sequential DFT computation on each thread in your parallel region
    • Use Intel Math Kernal Library (Intel MKL) FFTW3 wrappers
    • Want the best performance

    Intel MKL FFTW3 wrappers are thread safe by default. However, you should set one additional Intel MKL variable to get the best performance with Intel MKL. Set the number_of_users_threads variable as described below.

    In C:

    #include "fftw3.h"

  • Intermediate
  • Intel® C++ Composer XE
  • Intel® Math Kernel Library
  • OpenMP*
  • FFT
  • FFTW
  • FFTW3
  • fourier transform
  • MKL FFTW3 wrappers
  • number_of_users_threads
  • Optimization
  • Parallel Computing
  • Threading
  • Slides da palestra sobre Computação Paralela no FISL14

    A palestra "Como domar uma fera de 1 TFlop que cabe na palma da sua mão" foi apresentada em 3/7/13, no FISL14, por Luciano Palma - Community Manager da Intel para Servidores e Computação de Alto Desempenho.

    Além de introduzir conceitos de programação paralela e discutir a importãncia da implementação do paralelismo no software, Luciano apresentou o coprocessador Intel Xeon Phi, suas incríveis características técnicas (são até 61 cores gerando até 2 TFlops em precisão simples) e a arquitetura deste avançado coprocessador.

    Finding the right fit for your application on Intel® Xeon and Intel® Xeon Phi™ processors

    Not all applications are created equal.   Some are chomping at the bit to harvest as much parallelism as a target platform can provide.  Those may be good candidates for running on an Intel® Xeon Phi™ Coprocessor.  Other applications are scalar (not vectorized) and sequential (not threaded).  They won't even make full use of an Intel Xeon processor, much less an Intel Xeon Phi Coprocessor.  Before moving to a highly-parallel platform, the developer-tuner needs to expose enough parallelism to hit the limits of the Intel Xeon platform.  Once the demand for threads an

    Subscribe to OpenMP*