I've been trying to use parallel_scan to create the running minimum over a vector of floating-point numbers. running_min[i] records the smallest number seen in locations 0..i of the vector.
I'm doing the obvious:
- The vector is divided into subranges.
- The pre_scan version of the () operator computes the minimum value found in each subrange.
- The reverse_join method communicates the minimum found in subrange k over to subrange k+1, starting from k=0 (i hope). This tells subrange k+1 what the minimum value is in the entire vector preceding it.
- The final_scan version of the () operator computes the running minimum of each subrange using the minimum passed in by the reverse_join as the starting minimum.
This is not working. I run a serial version of the algorithm and then the parallel version, and their answers are different. For a ten-thousand element vector, the blocked_range partitions are [0,625), [625,1250), [1250,2500), [2500, 5000), and [5000, 10000), and the first difference always occurs at location running_min, right where the first two subranges meet. It appears the reverse_join is not passing the minimum from the previous subrange on to the next subrange.
I instrumented the code, but it was difficult to pick out the diagnostic messages with two physical threads running in parallel. However, I did notice that the final_scan was actually reporting completion before the pre_scan was, and reverse_join was never executing. In order to get a clearer picture, I created a task_scheduler_init(1) to limit it to a single thread. Then the diagnostics showed that only the final_scan was being called, and the pre_scan and reverse_join were never executed.
I'm using an AMD Athlon 64 X2 and Windows XP.
So I have a lot of problems. Here are some questions:
- Has anyone used parallel_scan and had it work correctly? (There is no example in the TBB tutorial, and the TBB reference has an incomplete example.)
- Is there a better way to instrument my code so I can see what is happening when multiple threads are operating?
- Why is the blocked range partitioning the vector that way? Shouldn't it cut it into more-or-less equal pieces?
- When parallel_scan calls the reverse_join() method, does it join subranges in order starting from the first subrange and proceeding to the last?
- Does running with a single thread change the behavior of parallel_scan? Shouldn't it still call pre_scan and reverse_join()?
Any help is appreciated.