Intel® Cilk Plus Software Development Kit

Superscalar programming 101 (Matrix Multiply) Part 5 of 5

In part 4 we saw the effects of the QuickThread Parallel Tag Team Transpose method of Matrix Multiplication performed on a Dual Xeon 5570 systems with 2 sockets and two L3 caches, each shared by four cores (8 threads). and each processor with four L2 and four L1 caches each shared by one core and 2 threads, we find:

  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Server
  • Intermediate
  • Intel® C++ Compiler
  • Intel® Fortran Compiler
  • Intel® Parallel Composer
  • Intel® Parallel Studio
  • Intel® Parallel Studio XE
  • Intel® Cilk Plus Software Development Kit
  • Parallel Computing
  • Superscalar programming 101 (Matrix Multiply) Part 3 of 5

    By Jim Dempsey

    In the previous article (part 2) we have seen that by reorganizing the loops and with use of temporary array we can observe a performance gain with SSE small vector optimizations (compiler does this) but a larger gain came from better cache utilization due to the layout change and array access order. The improvements pushed us into a memory bandwidth limitation whereby the Serial method now outperforms the Parallel method (of the Serial method).

  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Server
  • Intermediate
  • Intel® C++ Compiler
  • Intel® Fortran Compiler
  • Intel® Parallel Composer
  • Intel® Parallel Studio
  • Intel® Parallel Studio XE
  • Intel® Cilk Plus Software Development Kit
  • Parallel Computing
  • Superscalar Programming 101 (Matrix Multiply) Part 1 of 5

    By Jim Dempsey

    The subject matter of this article is: How to optimally tune a well known algorithm. We will take this well known (small) algorithm, a common approach to parallelizing this algorithm, a better approach to parallelizing this algorithm, and then produce a fully cache sensitized approach to parallelizing this algorithm. The intention of this article is to teach you a methodology of how to interpret the statistics gathered during test runs and then use those interpretations at improving your parallel code.

  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Server
  • Intermediate
  • Intel® C++ Compiler
  • Intel® Fortran Compiler
  • Intel® Parallel Composer
  • Intel® Parallel Studio
  • Intel® Parallel Studio XE
  • Intel® Cilk Plus Software Development Kit
  • Parallel Computing
  • Download Intel® Cilk++ SDK

    October 2013: This WhatIf project has been retired, but this page remains for historical/archival purposes.

    Warning! You are about to download an old, unsupported version of this software. For information about the current version of the compiler and matching runtime, please visit the Intel® Cilk™ Plus page. For tools supporting that compiler, please visit the SDK page.

  • Developers
  • Intel® Cilk Plus Software Development Kit
  • Parallel Computing
  • License Agreement: 

    Using Intel® C++ Composer XE 2011 for Linux to Thread Your Applications

    Tachyon is a ray-tracer application, rendering objects described in data files. The Tachyon program is located in the product samples directory: <install-dir>/composerxe/Samples/<locale>/C++/tachyon_compiler.tar.gz. By default we use balls.dat as the input file. Data files are stored in the directory tachyon/dat. Originally, Tachyon was an application with parallelism implemented in function pthread_create() with explicit threads: one thread does the rendering, and the other makes calculations.

  • Linux*
  • C/C++
  • Intel® C++ Compiler
  • Intel® C++ Composer XE
  • Intel® Threading Building Blocks
  • Intel® Cilk Plus Software Development Kit
  • OpenMP*
  • Subscribe to Intel® Cilk Plus Software Development Kit