Help with Bottom-up Mergesort

Help with Bottom-up Mergesort

I am trying to implement the bottom up merge sort using TBB. But am not getting the timing compared to the ballpark from top down approach. Kindly help me with this. i have attached the file.

AnexoTamanho
Download iter_merge.cpp3.94 KB
6 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
Best Reply

Just a few quick remarks… Please don't use C-style struct declarations (especially not in C++11 code!), and avoid void* with C-style casts: C++ programmers will want to look away from your code rather than analyse it (or maybe it's just me…). Bottom-up merge seems bad for the cache, you really want the divide-and-conquer approach from the top-down implementation, with depth-first exploration of the merge tree, because bottom-up won't be able to match it for performance. You should probably start with a base case of sequences of length SERIAL_CUT_OFF or thereabouts, and not just use it to decide what to do with even shorter sequences. Got to run...

Thanks Raf Schietekat. I have modified the merge sort parallel loops such as  

for(i= SORT_CUT_OFF; i <= (length/2)+1; i=i*2){

          tbb::parallel_for(i,length,i*2,[=](int j){                      

             if((j-i) - std::min(j+i,length) <= SORT_CUT_OFF){                                              
                 std::stable_sort(s->xs+(j-i),s->xs+std::min(j+i,length));
                         }
                         else{
                                 parallel_iterative_merge(s->xs+(j-i),s->xs+j,s->xs+j,s->xs+std::min(j+i,length),s->zs);                        
                         }
                 });
}

This has a better performance than the previous one. But still, the performance is not matching up with the recursive version. I am expecting a speed up similar to the recursive version. 

 

Without numbers, it's difficult to judge. Obviously with this change you have fewer rounds of cache thrashing, but you'll probably see performance improve further with larger SORT_CUT_OFF, possibly up to a point (I don't know off the top of my head how stable_sort behaves). How about some tuning experiments, to see how close you can get to your goal?

Jumping somewhat ahead, possibly missing other issues, I would expect this to illustrate the essential problem with bottom-up vs. top-down: top-down is cache-oblivious in the sense that it works with any cache size and essentially tunes itself, while bottom-up requires judicious choice of initial sorting length to comfortably fit inside the cache, with a miss too low requiring more rounds, and a miss too high depending on how stable_sort is implemented.

I see that MERGE_CUT_OFF is twice SORT_CUT_OFF, but merging is far less complex than sorting, so, if SORT_CUT_OFF weren't primarily a way to tune for cache size rather than parallel overhead, MERGE_CUT_OFF should be more than twice SORT_CUT_OFF. Now of course it's more difficult to say.

Thanks Raf, 

For 8388608 elements

SORT_CUT_OFF: 2097152 MERGE_CUT_OFF: 4194304 is used. The results I got in recursive TBB implementation is 0.915 seconds. While iterative version gives 1.317036 seconds for a single thread. I am still not confident with how to tune cache size with initial sort length. For a specific N, if my SORT_CUT_OFF is high, fewer rounds of sorting gives me an improvement, but am not sure how to find the perfect cache size. And when I increase my N, the performance is greatly reduced. 

Cache size is a given, it's not something you can tune.

(2014-01-09 Added) Haswell for example has a L3 cache size between 2 and 8 MB, so 2*SORT_CUT_OFF*sizeof(float) completely exceeds the highest value, and you're at the mercy of stable_sort's implementation even for L3. Try making a graph of execution times vs. SORT_CUT_OFF value in a region around the likely or known cache size of your system, which of course means that it depends on the size of the element. I'm not quite sure what the effect is with merge sort of giving each hardware thread an indivisible portion of the input, but that's not generally a good choice with TBB, where you should mostly strike a balance between parallel overhead (bigger parts better) and parallel slack (parts shouldn't get too big), and not subvert TBB's inherent adaptability, so that may skew the optimum SORT_CUT_OFF somewhat. Let us know what you find!

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!