Hello I've been writing a parallel tile-based software renderer in my spare-time using tbb. Recently I've been profiling and optimizing my code.
I've been using concurrent_vector to store results/intermediate values (screen-space triangles + set-up data) when using parallel algorithms and quite often my profiles shows a significant amount of time being spent on concurrent_vector::push_back yes I do use concurrent_vector::reserve in advance.
I get a lot better results using a TLS of std::vectors, it kind of makes sense as there is no or very little contention but the downside I see is I loose the use of single contiguous block of memory for a single (lock-free) vector which isn't a big deal really since I still have contiguous blocks of memory per-thread.
I was wondering is this generally the preferred method of creating results in parallel algorithms? anything I maybe missing here?