Applying Threading Building Blocks on Debian Etch

By Kevin Farnham (107 posts) on January 9, 2008 at 10:16 am

My last post ended with me having used Debian's apt package manager to install the libtbb2 and tbb-examples packages on my Debian 4.0 (Etch) machine. When I attempted to make the examples, error messages were displayed, including missing *.h include file messages.

Had it not been so late at night, I would have continued my investigation, since I was pretty certain the problems could easily be resolved. This indeed was the case, as Alexey Kukanov pointed out in the comment he posted yesterday.

Installing the libtbb-dev package

As Alexey pointed out, to get the header/include files, I need to install the libtbb-dev package. As you can see, the libtbb-dev file list includes all the *.h files.

So, I installed the libtbb-dev package using

apt-get install libtbb-dev

then returned to the instructions listed in the README.Debian file that comes with the tbb-examples package, changing into the tbb-examples/examples directory and entering:

make

This time the examples were built and executed.

TBB benchmarks on an Intel Pentium Hyper-threaded system

My HP nx9600 laptop (my Debian Etch system) is not a multicore machine. It's got an Intel Pentium 4 processor with Hyper-Threading. Conversations with people on the #tbb IRC channel made me curious as to what type of performance I'd see when running multithreaded applications on a processor with hyperthreading.

Sure enough, the performance when TBB examples were run using two threads was often (but not always) better than the performance when the examples were run using a single thread. For example, the Convex Hull TBB example problem, which I extensively investigated in November 2007, returned these timings for 1 - 8 threads (unbuffered push_back version):

Threads Initialization
Time (sec)
Calculation
Time (sec)
1 0.84 3.02
2 0.62 2.41
3 0.58 2.60
4 0.60 2.45
5 0.55 2.48
6 0.62 2.59
7 0.57 2.45
8 0.56 2.44

Similarly, the tacheon examples showed improved performance with TBB applied:

Program CPU Time
(sec)
tacheon.serial 20.52
tacheon.tbb1d 14.74
tacheon.tbb 15.56

However, in some other cases the performance for more than one thread was significantly slower. For example, here are the results for 1 - 4 threads for the fibonacci example:

./fibonacci 500
Fibonacci numbers example. Generating 500 numbers..
Serial loop     - in 0.340486 msec
Serial matrix   - in 4.977169 msec
Serial vector   - in 12.881134 msec
Serial queue    - in 61.615816 msec
Threads number is 1
Shared serial (mutex)           - in 12.008520 msec
Shared serial (spin_mutex)      - in 5.413665 msec
Shared serial (queuing_mutex)   - in 11.099723 msec
Shared serial (Conc.HashTable)  - in 218.821127 msec
Parallel while+for/queue        - in 81.017869 msec
Parallel pipe/queue             - in 122.565921 msec
Parallel reduce                 - in 26.709551 msec
Parallel scan                   - in 10.159765 msec
Parallel tasks                  - in 27.199792 msec
Threads number is 2
Shared serial (mutex)           - in 42.876632 msec
Shared serial (spin_mutex)      - in 16.839141 msec
Shared serial (queuing_mutex)   - in 20.759846 msec
Shared serial (Conc.HashTable)  - in 256.510596 msec
Parallel while+for/queue        - in 78.621992 msec
Parallel pipe/queue             - in 139.240823 msec
Parallel reduce                 - in 23.885403 msec
Parallel scan                   - in 16.329680 msec
Parallel tasks                  - in 25.740075 msec
Threads number is 3
Shared serial (mutex)           - in 50.188160 msec
Shared serial (spin_mutex)      - in 14.955134 msec
Shared serial (queuing_mutex)   - in 21.668117 msec
Shared serial (Conc.HashTable)  - in 229.680181 msec
Parallel while+for/queue        - in 686.667691 msec
Parallel pipe/queue             - in 129.353032 msec
Parallel reduce                 - in 24.940150 msec
Parallel scan                   - in 18.163549 msec
Parallel tasks                  - in 26.354788 msec
Threads number is 4
Shared serial (mutex)           - in 44.824017 msec
Shared serial (spin_mutex)      - in 14.007360 msec
Shared serial (queuing_mutex)   - in 22.169669 msec
Shared serial (Conc.HashTable)  - in 273.406177 msec
Parallel while+for/queue        - in 78.259395 msec
Parallel pipe/queue             - in 134.116594 msec
Parallel reduce                 - in 25.003553 msec
Parallel scan                   - in 18.081328 msec
Parallel tasks                  - in 26.773415 msec
Fibonacci number #500 modulo 2^64 is 2171430676560690477

Coming up: TBB packages on Ubuntu

Now that TBB packages are available for Debian unstable (sid), the packages should become available in Ubuntu before long. People on the #tbb IRC channel jumped on this right away. Ubuntu Hardy Heron is currently scheduled for release in April 2008. However, the Debian import freeze for Ubuntu Hardy Heron occurred on December 13, 2007. Undeterred, our TBB friend Sadiq Jaffer submitted a tbb debian sync request to the Ubuntu team, requesting inclusion of the Debian TBB packages in Ubuntu Hardy Heron. So, we'll see what happens. Thanks, Sadiq!

More credit where it is due: Athena Capital Research

Before I close this series on TBB and Debian, I'd like to recognize the support that has been provided to the Debian TBB team by Athena Capital Research. Athena is funding, among other things, Roberto Sanchez's work on packaging TBB for Debian. A quick perusal of the Athena site shows us why Athena would be interested in a project like Threading Building Blocks: they focus on quantitative investing, "specializing in fully automated trading strategies".

This is an area that has been a core interest of mine in the past. In the 1980s and 1990s I spent thousands of hours inventing mathematical models for investing (see my final report on these efforts, if you're interested). My models were based on a quite simple input data set: end-of-day securities prices or index values.

The advent of cheap processing power means much more complex models can be applied in near real time to identify temporary anomalies in market pricing. Exploiting these anomalies can be one of the safer ways to make short-term profits in the market. But, to do so in a fast-changing market, you need a lot of processing power, and very fast computation (faster than your competitors, who will also be crunching the same input data using their own algorithms). The company whose algorithms are multithreaded will have a better chance to spot temporary pricing anomalies sooner, enabling them to execute the necessary trades before a competitor has a chance to react.

It's great to see companies like Athena Capital Research and Intel contributing to the growth of open source projects like Threading Building Blocks, Eclipse, and the Scythe Statistical Library.

Kevin Farnham, O'Reilly Media, TBB Open Source Community, Freenode IRC #tbb, TBB Mailing Lists

Download TBB

Categories: Open Source, Parallel Programming, Threading Building Blocks

Comments (5)

January 10, 2008 1:11 PM PST

Alexey Kukanov (Intel)
Alexey Kukanov (Intel)Total Points:
14,341
Status Points:
14,341
Black Belt
Please do not use the Fibonacci example fro any performance experiments or comparison. It was not designed to show any speedup; its goal is to be a TBB live test touching the majority of TBB features. For the open sourced TBB, that has little sense due to availability of the developers' test suite; thus it is there primarily for commercial customers.
January 11, 2008 12:19 PM PST

Kevin Farnham
Kevin Farnham
Alexey: thanks for letting us know about this.
January 14, 2008 9:25 AM PST


Igor Borski
Hi Kevin,
hope you've posted a bug on tbb-examples so it should depend on libtbb-dev (or at least reccommend it)
--
Regards, Igor
January 20, 2008 1:22 PM PST

Kevin Farnham
Kevin Farnham
Igor: this change has been made by the Debian TBB packaging team -- the next tbb-examples package release will recommend the libtbb-dev package to users.
January 24, 2008 2:08 AM PST


Mabeth Borres
Guys,
Don't use parallel_for (etc) for simple loops. If you really want to see some speedups, put in some complexity in the loop. Fibonacci and integer summations are simple loop implementations. Add some math calls into your summation loops (cos(), sin(), etc..). Then you will find that there is a speedup (serial vs. parallel). Also initialise more than one thread (optimum at 8 threads).

Trackbacks (1)


Leave a comment  

To obtain technical support, please go to Software Support.
Name (required)*

Email (required; will not be displayed on this page)*

Your URL (optional)


Comment*