tbb::concurrent_vector: secrets of memory organization

This blog entry is intended to help you better understand the way concurrent vector works with the memory and how to use it wisely for your program to work faster or consume less memory.


The concurrency comes with the price

Let’s open Threading Building Blocks (TBB) reference manual at point 4.3.”Fragmentation”:

“Unlike a std::vector, a concurrent_vector never moves existing elements when it grows. The container allocates a series of contiguous arrays.”

It sheds some light on the main idea of the container: in order to grow, instead of allocation of a bigger array and copying old items there, it just allocates additional chunk of memory named “segment”. Thus, it ensures the main concurrency-related properties of the container (see the reference, 4.3):

“- Multiple threads can grow the container and append new elements concurrently.
- Growing the container does not invalidate existing iterators or indices.”

But the price is the absence of contiguous memory space like in std::vector, the need to keep pointers to segments in a table, and additional level of indirection when accessing vector’s elements. Another thing is that for sake of correctness and effectiveness of concurrent growth the container was made to not shrink in size.


Segment sizing rule

To make the size of the table finite and to optimize addressing of items, the number of items in the segment (which I will further call the segment size) must be a power of two and must equal the total size of all previously allocated segments. So, the concurrent_vector grows twice in capacity and it’s another difference from std::vector.

Thus, the maximal number of segments and so the number of pointers to keep cannot be bigger than number of bits in size_t type used for item indexation. Also, to calculate a segment size or an item address, simple arithmetic and bitwise operations are enough.


Table of segments

Well, in order to store pointers for maximal number of segments, we need a table of 32 pointers for a 32-bit program or 64 pointers for a 64-bit program, i.e. 128 and 512 bytes respectively. One may consider that these numbers are too costly for an instance of the container that stores data of a similar size. It becomes notably costly if there is a structure with big quantity of concurrent vector instances but each usually is empty or it has just a couple of items.

To avoid this cost, the container keeps a small embedded segment table that enables a vector instance to grow up to some limit without additional memory overhead. Then, when the limit is hit, the instance switches to a full segment table allocated together with the next segment and filled by the pointers from the embedded table. So, it saves the memory for empty or of a few elements container instances.


Fragmentation and memory blow-up

In TBB 2.1, we reduced the size of the first segment to 2 elements. It optimizes item addressing and so speeds up an item access time. Thus, the series of segment sizes is: 2, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, and so on.

But being allocated separately, it can lead to the next problems.

The tbb::cache_aligned_allocator, which is used as the default template allocator for items in concurrent vector, allocates more memory than requested in order to align the returned address. It can waste a lot of memory when allocating segments of small sizes. To prevent this, consider using tbb_allocator explicitly instead.

Another problem is that segments can be scattered across the memory, which prevents effective caching and pre-fetching and so hurts the performance. Before TBB 2.1, it was solved by a bigger size of the first segment but that required sophisticated address calculations and so slowed down item access speed.


Joined allocation

In the new version, the library joins a few first segments into a solid allocation in the only one request. It turns out that the first few pointers address the same contiguous block in the memory but with different offsets according to segment sizes.

But how can the optimal number of segment in the joint allocation be determined? There is no common solution for everyone. Thus, this number is determined to be big enough to allocate the number of items specified in the first call to a grow method.

But if the first call is push_back() which adds only one item to the container then the size of the first allocation would equal the size of the first segment, i.e. no segments would be joined. That means we are back into the fragmentation problems described above.


Defragmentation

There are several ways to solve the problem.

First, you could specify the required size as an argument to the constructor. However it leads to default construction of the specified number of items, so they cannot be initialized later with push_back().

The second way is to use reserve() method before the first call to push_back(). It solves the problem by allocating enough space in a contiguous block but you have to estimate this minimal necessary number of items being pushed into the vector.

BTW, the above two ways are applicable to std::vector as well.

And if your algorithm doesn’t allow the reservation, consider the third way. You could call the method compact() before you start reading the data from the container. This call joins an optimal number of segments into a solid block and also it can free unused segments allocated within reserve().

But be careful, all these methods are not thread-safe. And let me know if you have a case where you really need a thread-safe way to join the first segments.


Summary

To summarize, you have the following options to make your program use the concurrent_vector optimally:

- Specify an appropriate allocator in the template arguments;
-
Identify minimal number of items in the vector and use the constructor argument or reserve() call to pass it into;
-
Call compact() if you have to iterate many times across the whole container.

See also other blog about size of the concurrent_vector.

 

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Please see at the blog that discuss the difference between STL and TBB containers:
http://software.intel.com/en-us/blogs/2008/10/13/tbb-containers-vs-stl-functionality-rift/

Note also, that concurrent_vector does not synchronize access to items like concurrent_hash_map. It keeps it for user to decide whether such a synchronization is necessary and how can be implemented. So, it synchronizes concurrent growth requests only. And it's perfectly fine for use cases that just collect a data in the vector for further processing because more synchronizations == more overhead.
But without data access synchronization, concurrent erase() is not possible. Because lifetime of data items can't be synchronized like it's done in concurrent_hash_map.
Instead, you could build an extended version of concurrent_vector on top of it. "Merely" mark deleted items, synchronize access, and move the data. ;-)


Doesn't the absence of erase methodes render this method useless in most cases? Wouldn't mutexes be more efficient and enable the addition of erase and insert methods? I do see a rather large problem with const functions though. For example, "const_reference operator[]( size_type index ) const;" would require a lock if more mutexes were used but since it is a const method it would not be able to lock a class mutex member. Iterators would also pose a problems although locks could be released upon destruction. Is the const issue a fundamental barrier or does someone know a way around this?