To improve the performance of applications and kernels we are constantly on the search for novel Best Known Methods or BKMs, but as our searches grow more esoteric, it is important to keep in mind the basics and how many performance improvements rely on them. This article will describe some common BKMs for improving parallel performance and show their application over this spectrum of processor architectures. The advice collected here should help you speed up your code, whether running on an Intel® Xeon Phi™ coprocessor or an Intel Xeon process
By now, many of you have heard of Intel® Transactional Synchronization Extensions (Intel® TSX). If you have not, I encourage you to check out this page (http://www.intel.com/software/tsx) before you read further. In a nutshell, Intel TSX provides transactional memory support in hardware, making the lives of developers who need to write synchronization codes for concurrent and parallel applications easier.
Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
This tutorial introduces the basic hardware and software architecture of the Intel Xeon Phi coprocessor, describing their general features, and provides a first view of the various programming models that support High Performance Computing on hosts equipped with the coprocessor, including offloading selected code to the coprocessor, Virtual Shared Memory, and parallel programming using OpenMP*, Intel Cilk Plus, and Intel Threading Building Blocks.
The figures, tables, drawings, etc. used in our book can be downloaded from the book's website. We appreciate attribution, but there are no restrictions on use in educational material (presentations)!
Suggestion attribution: (c) 2013 Jim Jeffers and James Reinders, used with permission.
Consider the case when you
- Create a FFTW3 plan and use the plan for sequential DFT computation on each thread in your parallel region
- Use Intel Math Kernal Library (Intel MKL) FFTW3 wrappers
- Want the best performance
Intel MKL FFTW3 wrappers are thread safe by default. However, you should set one additional Intel MKL variable to get the best performance with Intel MKL. Set the number_of_users_threads variable as described below.
A palestra "Como domar uma fera de 1 TFlop que cabe na palma da sua mão" foi apresentada em 3/7/13, no FISL14, por Luciano Palma - Community Manager da Intel para Servidores e Computação de Alto Desempenho.
Além de introduzir conceitos de programação paralela e discutir a importãncia da implementação do paralelismo no software, Luciano apresentou o coprocessador Intel Xeon Phi, suas incríveis características técnicas (são até 61 cores gerando até 2 TFlops em precisão simples) e a arquitetura deste avançado coprocessador.
Not all applications are created equal. Some are chomping at the bit to harvest as much parallelism as a target platform can provide. Those may be good candidates for running on an Intel® Xeon Phi™ Coprocessor. Other applications are scalar (not vectorized) and sequential (not threaded). They won't even make full use of an Intel Xeon processor, much less an Intel Xeon Phi Coprocessor. Before moving to a highly-parallel platform, the developer-tuner needs to expose enough parallelism to hit the limits of the Intel Xeon platform. Once the demand for threads an