I ran the quicksort code which comes with the TBB book. (It can be downloaded from http://softwarecommunity.intel.com/isn/downloads/examples.zip) In its current implementation, it uses parallel_for and a custom range with the splitting constructor doing the partioning. The function body itself is just a serial sort. Now, the partitioning often splits the range into unequal partitions. However, because of the nature of parallel_for and the splitting constructor, the thread that partitions the range consumes the left partition next and the right partition is pushed to the task queue to be stolen. Can a better decision be made by using explicit tasks? At any stage the following scenarios are possible:
1) Both the partitions fit in the cache
2) Only the smaller partition fits in the cache
3) Neither of them fits fully in the cache
If both of them fit in the cache, it would be better to execute the bigger one next and push the smaller one on the queue right? I'm not sure what a good strategy would be for the other two cases. Comments?