I am moving forward with my first test application. I finally got it running, however my test benchmark still is "too slow".
The current code runs in 150 [ms] (measured using arbb::scoped_timer, using an arbb::auto_closure created out of the scope) and I expected it to run at least three times faster (~40 [ms]), based on other "high efficiency" implementations.
My guess is that I am still using ArBB wrong.
My current test code looks like this:
void compute_cost_volume(image_data_t input_a, image_data_t input_b, cost_volume_t &cost_volume)
_for(arbb::usize d=0, d < cost_volume.num_pages(), d+=1 )
cost_slice_2d_t cost_page = cost_volume.page(d);
image_data_t shifted_b = arbb::shift_col(input_b, d);
arbb::map(compute_cost)(input_a, shifted_b, cost_page);
cost_volume = arbb::replace_page(cost_volume, d, cost_page);
where compute_cost is a small operation, image_data_t is dense, 2> and cost_volume_t is dense.
My guess is that the "replace_page trick" is a bad idea. So my questions are:
1) How can I measure/know what is killing my perfomance on this example ? Which is the suggest profiling protocol to follow ? (under linux)
2) I first tried to access cost_page via reference, or replace the _for loop for a arbb::map, but non of these attempts compiled.
The access to each cost_volume page is fully parallel, I would expect then to be able to formulate it in some kind of map construct (or similar). Which is the proper way of doing this ?
In boost::multi_array it is possible to define all kind of views and iterators over a given data volume, I could not find the equivalent in arbb, so I guess another strategy is to be used.
I guess this example would bring some light on the repeated "to loop or not to loop inside arbb::call code" discussion.
Thanks for the answers and the community support.