Problems with parallel_sort

Problems with parallel_sort

Good day, colleagues!

I'm newbie in concurrent programming and I've encountered a problem with parallel_sort. Currently I'm writing small program which is have to sort big binary files with limited amount of memory.

At the first step, I'm reading file to be sorted, split file to chunks (for example 10 MB each) and sort each chunk. The problem is when I'm applying parallel_sort to chunk, it performs more than 3 times slower than std::sort. Could you advice me, what I'm doing wrong? Thank you in advance.

Code is attached.
My machine is Core i7 860, compiler - Visual Studio 2010.

Download main_tbb.cpp7.23 KB
5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


may be my code was too tagled. I've created a new simple code where I just create a vector and concurrent vector (both are 1M integers) and sort them via std::sort and tbb::parallel_sort respectively. Running times are 1500 and 8000 CPU clocks respectively - std::sort is 5 times faster.

What is the problem in my code?

#include "tbb\parallel_sort.h"
#include "tbb\concurrent_vector.h"

using std::vector;
using tbb::concurrent_vector;
using tbb::parallel_sort;

const int SIZE = 1000000;

void Generate_Vector (int size, vector * target) {
for (int index = 0; index < size; ++index) {
target->at(index) = rand();

int main () {
srand (300);
vector serial;
Generate_Vector(SIZE, &serial);
concurrent_vector parallel (serial.begin(), serial.end());

clock_t start, finish;

start = clock();
std::sort(serial.begin(), serial.end());
finish = clock();

std::cout << "std::sort time is " << finish - start << std::endl;

start = clock();
tbb::parallel_sort (parallel.begin(), parallel.end());
finish = clock();

std::cout << "parallel sort time is " << finish - start << std::endl;

return 0;

What is the reason for usage of tbb::concurrent_vector?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:

First I've tried to use std::vector, but it worked even slower, and CPU load was only 20-40% while with concurrent_vector it was 100%.

Important update - all results above were derived from Debug configuration. When I switched to Release and used std::vector, all become OK - CPU times was 78 for std::sort and 26 for tbb::parallel sort.

Debug versions of STL have a LOT of additional non-scalable checks. For a example an STL container can have a mutex-protected sub-container of all iterators to into it, since it's mutex-protected, it's non-scalable.
If you are using MSVC try define:
# define _SECURE_SCL 0

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:

Leave a Comment

Please sign in to add a comment. Not a member? Join today