This blog contains additional content for the article "Advanced Vectorization" from Parallel Universe #12:
Intel System Studio not only provides a variety of signal processing primitives via Intel® Integrated Performance Primitives (Intel® IPP), and Intel® Math Kernel Library (Intel® MKL), but also allows developing high-performance low-latency custom code (Intel C++ Compiler with Intel Cilk Plus). Since Intel Cilk Plus is built into the compiler, it can be used where it demands an efficient threading...
Intel® Cilk™ Plus enabled parallelizing a chess puzzle solver with a few changes.
Continuing my previous post, I describe some of the challenges in implementing DotMix, a determinstic parallel random-number generator (DPRNG) for Intel® Cilk™ Plus.
Pedigrees are a new feature implemented in Intel Cilk Plus and currently available in Intel® Composer XE 2013. In this post, I explain what pedigrees are, how they work, and how you can use them in Cilk Plus. Pedigrees are a key component used in the implementation of DotMix, a contributed code for a deterministic parallel random-number generator (DPRNG) discussed in my previous post.
In this article, I discuss some common performance pitfalls in Cilk™ Plus programs that prevent users from seeing speedups in their code, and describe some techniques for avoiding these pitfalls.
This is the second article in a series of articles about High Performance Computing with the Intel Xeon Phi.
The Intel® Cilk™ Plus C/C++ language extensions support the expression of portable and efficient task and vector parallel programs. Cilk Plus/LLVM is an implementation of these extensions in the Clang frontend for LLVM. In this article we explain one of the optimizations that we have implemented in Cilk Plus/LLVM: late-initialization of frame descriptors. With this explanation, we provide a...
Apply the concepts of parallelism and distributed memory computing to your code to improve software performance. This paper expands on concepts discussed in Part 1, to consider parallelism, both vectorization (single instruction multiple data SIMD) as well as shared memory parallelism (threading), and distributed memory computing.
This article focuses on the steps to improve software performance with vectorization. Included are examples of full applications along with some simpler cases to illustrate the steps to vectorization.