I'm experimenting with TBB to replace some home-brew parallelization framework used for algorithms which can have the input subdivided and processed in independent blocks--- an obvious use-case for parallel_for.
In the existing implementation, the input is subdivided into blocks for two reasons: parallelism and memory utilization. In most cases, we need temporary storage equal in size to the input in order to run the algorithm in question; dividing the input into blocks not only opens the door to parallelism, it also puts an upper bound on the amount of temorary storage needed.
I found it relatively straight-forward to convert the existing code to TBB's parallel_for using a blocked_range and setting the grain size to be our former block size. The thing that still has me fumbling is how to handle the temporary storage. Here's a break-down of alternatives and my analysis of each:
- Put the temporary storage on the stack in the body given to parallel_for. The amount of storage allocated will by bounded by actual parallelsim (good!) but the amount allocated may be "too much" to prudently put on the stack (possibly bad!)
- Allocate heap storage from within the body given to parallel_for. The amount of temporary storage allocated will be bounded by actual parallelism (good!) but we'll pay the cost of repeatedly allocating and deallocating for each task (very bad!)
- Allocate heap storage from the copy-constructor for the body. This is no good because we'll instantly pay in both time and space for temporary storage based on problem size divided by grain size (the number of tasks created by parallel_for)
The existing implementation uses threads directly and is able to create one temporary storage buffer for each thread and to do so once at the start of the calculation. This has the desirable properties of bounding temporary storage requirements by actual parallelism (the number of threads), and to amortize the cost of allocation across the entire calculation.
It seems like putting the temporary storage on the stack is most natural, and I suppose I could reduce the grain size to try to keep a lid on the amount used. I'd be concerned, however, that the resulting grain size might be too small and result in reduced efficiency on machines with limited or no extra processors.
Does anyone have any comments or suggestions on the "best" way to solve this sort of problem?



