Evaluation of Cilk++ as a platform for multi-core numeric computation

Submit New Article

October 28, 2009 1:00 AM PDT


Posted by Ilya Mirman originally on www.cilk.com on Thu, Apr 16, 2009

(The following post is from the engineers at our partner Sitrus, an outsourced development firm that offers multicore-enablement services for C++ applications using Cilk++.)

To evaluate Cilk++, we decided to implement several algorithms from different areas of numeric computation and to compare their performance with C++ equivalents. Four functions were implemented:

  • The first two functions were taken from Linear Algebra.  The first computes the determinant of a matrix with non-zero leading principal minors. The algorithm of this function is based on LU decomposition of the input matrix.  The second function computes the conjugated transposed matrix from the complex input.
  • The third function is taken from Signal Processing. It computes the convolution of two matrices. The algorithm explicitly implements the definition of 2D convolution.
  • And the fourth function is the solution of an ordinary differential equation problem. It implements the well-known N-body problem. To solve the system of differential equations, the Runge – Kutta algorithm of 4th order was used: let's take the equation , then the value of the next point will be computed as: , where , , , .

The algorithms were initially implemented in C++ and then converted into parallel form using Cilk++'s cilk_for, cilk_spawn and cilk_sync keywords. The conversion was very easy to implement (the source code is available here). The Cilkscreen race detector was found to be very useful for debugging the parallel versions of the programs.

The tests were run on a Dual-socket Quad-Core AMD Opteron Processor 2347, running Linux opteron 2.6.27.5-117.fc10.x86_64 with 20GB RAM system.

The test programs were built using gcc version 4.2.4 (Cilk Arts build 7007) compiler.

All the tests were run 10 times, and the time of execution was measured using cilk++'s example_get_time() routine. Presented here are the average execution times of the serial C++ and multicore-enabled programs on 1, 2, 4, 8 cores.

 

 

Comparing previous results with Cilkscreen Parallel Performance Analyzer, we see that the actual performance values fall nicely in the middle of the speedup estimate range:

 

 

 

 

 

 

 

 

 

 

 

 

 

Key take-aways

  • It was straight-forward to multicore-enable several compute-intensive operation
  • Race Detector was helpful in finding and resolving race bugs
  • Performance Analyzer's predictions were well in line with measured results