# Threaded Programming Methodology with Parallel Studio

In this 3 hour module, participants will learn the evolution of parallel processing architectures. After completing this module, a student should be able to describe how threading architectures relates to software development, to rapidly estimate the effort required to thread time consuming regions and to prototype the solution.

Topics covered include:

# Utiliser Intel VTune pour identifier les goulets d'étranglements

C'est le premier concours Accelerate sur lequel nous avons aussi facilement accès aux outils d'Intel tel qu'Intel Inspector, ou Intel VTune.
Lors de l'édition précédente, Maxime et moi avions souffert de ce manque. Nous avions identifié trop tard un goulet dans notre code en nous connectant en mode bureau sur la MTL par SSH pour exécuter Vtune et en regardant le profil d'exécution.

# Intel® MKL Poisson Library and Trigonometric Transform: Overview

Intel® MKL introduces tools for solving Partial Differential Equations (PDE). These tools are Trigonometric Transform interface and Poisson Library routines.
• Apple OS X*
• Linux*
• Microsoft Windows* (XP, Vista, 7)
• Cluster Tools
• Intel® C++ Compiler
• Intel® Composer XE
• Intel® Fortran Compiler
• Intel® Parallel Composer
• Intel® Math Kernel Library
• Intel® Parallel Studio
• Intel® Parallel Studio XE
• Partial Differential Equations (PDE)
• Poisson Library
• Trigonometric Transform Routines

# Superscalar programming 101 (Matrix Multiply) Part 5 of 5

In part 4 we saw the effects of the QuickThread Parallel Tag Team Transpose method of Matrix Multiplication performed on a Dual Xeon 5570 systems with 2 sockets and two L3 caches, each shared by four cores (8 threads). and each processor with four L2 and four L1 caches each shared by one core and 2 threads, we find:

• Linux*
• Microsoft Windows* (XP, Vista, 7)
• Server
• Intermediate
• Intel® C++ Compiler
• Intel® Fortran Compiler
• Intel® Parallel Composer
• Intel® Parallel Studio
• Intel® Parallel Studio XE
• Intel® Cilk Plus Software Development Kit
• Parallel Computing
• # Superscalar programming 101 (Matrix Multiply) Part 4 of 5

In the last installment (Part 3) we saw the effects of the QuickThread Parallel Tag Team method of Matrix Multiplication performed on two single processor systems:

• Linux*
• Microsoft Windows* (XP, Vista, 7)
• Server
• Intermediate
• Intel® C++ Compiler
• Intel® Fortran Compiler
• Intel® Parallel Composer
• Intel® Parallel Studio
• Intel® Parallel Studio XE
• Intel® Cilk Plus Software Development Kit
• Parallel Computing
• # Superscalar programming 101 (Matrix Multiply) Part 3 of 5

By Jim Dempsey

In the previous article (part 2) we have seen that by reorganizing the loops and with use of temporary array we can observe a performance gain with SSE small vector optimizations (compiler does this) but a larger gain came from better cache utilization due to the layout change and array access order. The improvements pushed us into a memory bandwidth limitation whereby the Serial method now outperforms the Parallel method (of the Serial method).

• Linux*
• Microsoft Windows* (XP, Vista, 7)
• Server
• Intermediate
• Intel® C++ Compiler
• Intel® Fortran Compiler
• Intel® Parallel Composer
• Intel® Parallel Studio
• Intel® Parallel Studio XE
• Intel® Cilk Plus Software Development Kit
• Parallel Computing

# Superscalar Programming 101 (Matrix Multiply) Part 1 of 5

By Jim Dempsey