Bad performance with fibonacci test on Core 2 Quad Q9550 than Core 2 Duo E6850

Bad performance with fibonacci test on Core 2 Quad Q9550 than Core 2 Duo E6850

Hi,

I'm new to TBB. So I've compile the fibonacci appplication includes in the TBB package ( examples/test_all ).
I've compile this application with Visual Studio 2008 in Release and TBB 2.2.

So I've run the application on a Core 2 Duo E6850 (With Windows 7 ) and on a Core 2 Quad Q9550 ( With Windows XP ).

The parameter passed to the fibonacci was 50000.

For the Serial test, the results are :
--------------------------- Core 2 Duo E6850 --------- Core 2 Quad Q9550
- Serial loop | 5196 ms | 5515 ms
- Serial matrix | 92005 ms | 99337 ms
- Serial vector | 277422 ms | 279433 ms
- Serial queue | 1032599 ms | 1039019 ms

As you can see, the Core 2 Duo is better than the Core 2 Quad. But, this result can be logical because the core 2 Duo has a frequency of 3GHz and the Core 2 Quad 2.83GHz.

With the test with 1 thread, the results are :
------------------------------- Core 2 Duo E6850 --------- Core 2 Quad Q9550
- mutex | 88570 ms | 82464 ms
- spin_mutex | 64184 ms | 62796 ms
- queuing_mutex | 61986 ms | 68640 ms
- Conc.Hastable | 2383970 ms | 2161845 ms
- Parallel while + for | 871339 ms | 828461 ms
- Parallel pipe/queue | 1638076 ms | 1234360 ms
- Parallel reduce | 115842 ms | 120197 ms
- Parallel scan | 118224 ms | 121967 ms
- Parallel tasks | 222433 ms | 228112 ms

Here, In general the Q9550 processor is better than E6850. But we can see that the E6850 is better for Parallel reduce/scan/tasks. Why ?

With the test with 2 threads, the results are :
------------------------------- Core 2 Duo E6850 --------- Core 2 Quad Q9550

- mutex | 180791 ms | 17630195 ms

- spin_mutex | 78605 ms | 102566 ms

- queuing_mutex | 164113 ms | 216254 ms

- Conc.Hastable | 1422683 ms | 2066588 ms
- Parallel while + for | 495898 ms | 556969 ms
- Parallel pipe/queue | 924551 ms | 1409204 ms
- Parallel reduce | 65100 ms | 61047 ms
- Parallel scan | 116001 ms | 120051 ms
- Parallel tasks | 113971 ms | 113051 ms

Here, we are surprised by the bad performance on the Q9550 processor and especially for the mutex test !!
The best performance for Q9550 is for the Parallel reduce test. why ?

With the test with 4 threads, the results are :

------------------------------- Core 2 Duo E6850 --------- Core 2 Quad Q9550

- mutex | 171978 ms | 13282170 ms

- spin_mutex | 73184 ms | 134984 ms

- queuing_mutex | 578664 ms | 733882 ms

- Conc.Hastable | 1419792 ms | 2134029 ms

- Parallel while + for | 552766 ms | 432449 ms

- Parallel pipe/queue | 1179087 ms | 1837455 ms

- Parallel reduce | 64621 ms | 32136 ms

- Parallel scan | 116889 ms | 63675 ms

- Parallel tasks | 114261 ms | 57079 ms

Here, we are another surprised by the bad performance on the Q9550 processor and especially for the mutex test !!
But, The Q9550 processor is better for the Parallel reduce/scan/tasks tests et the time of execution is almost reduce by half. So Here it's seems logical...

So, does anyone have any ideas about the poor performance observed ?
Could you help me.

Thanks a lot

gawik

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Gawik,

fibonacci example is not the best example of scalability:) there is the quotation from its index.html description:

"The purpose of the example is to exercise every include file and class in
Threading Building Blocks. Most of the computations are deliberately silly and
not expected to show any speedup on multiprocessors."

--Vladimir

> Here, In general the Q9550 processor is better than E6850. But we can
see that the E6850 is better for Parallel reduce/scan/tasks. Why ?

Because E6850 has higher frequency; and since the test is executed with only 2 threads, 2 additional threads of Q9550 do not matter.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

> Here, we are surprised by the bad performance on the Q9550 processor and
especially for the mutex test !!

Nope, we are not.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

The fibonacci test is just a quick way to check that the package works. It was never meant to be a serious performance test.

For performance-oriented tests, look at the other tests, particularly:

  • examples/parallel_for/
  • exmaples/parallel_reduce/
  • examples/task_group/

The ones I didn't mention in examples/ are also written for performance, but tend to lose steam because of memory bandwidth or I/O issues.

Another code that uses TBB and has been written for performance is my Seismic Duck. Though I didn't try to make it scale past four cores, because that was enough to meet frame rate requirements. I've started a series of blogs on its implementation and the programming patterns used to get high performance. To use it as a benchmark:

  • Press F to toggle display of the frame rate.
  • Disable the default frame rate limit of 60 frames/sec with these steps:
    1. Select View->Speed
    2. Move the "Frame Rate Limit" slider to infinity.

    The framerate is sensitive to montior resolution, so rates are comparable only done for the same display resolution.

    In my experience sub_string_finder_extended.cpp scales almost linearly with the number of cores. Make sure you have the same optimization settings on all platforms. Somehow the included Visual Studio solution gets messed during conversion and optimization isn't enabled by default.

    thank you very much for your answers .. I will continue my investigations

    Leave a Comment

    Please sign in to add a comment. Not a member? Join today