Intel® Threading Building Blocks Version 3.0 Update 4 showcases its first Community Preview feature: Concurrent Priority Queue

The Intel® Threading Building Blocks (Intel® TBB) 3.0 Update 4 release introduces a new concept: the Community Preview feature, which gives users of Intel TBB an early look at features targeted for future release and provides the Intel TBB development team with valuable feedback.

In addition to the usual rich palette of algorithms, containers, and other concurrency tools that Intel TBB provides, the version 3.0 U4 release features a new concurrent container, the concurrent priority queue.

A priority queue is a queue of elements, each of which has a priority associated with it. The queue is maintained in order of the priorities on the elements. For example, one might have a stream of incoming events which are immediately enqueued and later dequeued to be processed one-by-one. Some events are more important than others, so we would like those events to be handled earlier. Thus, we might assign a priority to each event in the form of an integer. We then use a priority queue to queue the events, and when we dequeue an element from the queue, it will be the element with the highest priority. This pattern turns up in many applications, but the concept has a variety of other uses.

The C++ Standard Template Library provides a serial std::priority_queue container. However, when we use this type of container in multi-threaded programming, it must be capable of handling the primary operations of a priority queue in a concurrent fashion. The common practice is to wrap std::priority_queue operations with coarse locks, and it is important to note that attempts at bettering the performance of a coarse-locked std::priority_queue often fall short.

Our goal in developing a concurrent priority queue for Intel TBB is to provide concurrent functionality with performance competitive to a coarse-locked std::priority_queue for low numbers of threads, and with better scalability for higher numbers of threads. We believe we have achieved this goal, but we would like Intel TBB users to try this Community Preview Feature and let us know how it works for them in their applications.

Using the Intel TBB Concurrent Priority Queue

The tbb::concurrent_priority_queue container is a template class that lets the user specify both the type of element to be queued and a comparison operator for comparing the priorities of elements. For example:

tbb::concurrent_priority_queue<int, std::less<int> > my_cpq;

The above declaration of my_cpq specifies a queue of integers sorted simply by the std::less<int> operator. Note that the comparison operator should take two arguments of the element type and has the following behavior: If compare(A,B) returns true, then B has higher priority, otherwise A has higher priority.

A more interesting example follows:


class point_t {
public:
double x, y;
point_t(double _x, double _y) : x(_x), y(_y) {}
double dist(const point_t& other);
// other declarations, methods, etc.
};

const point_t origin(0.0, 0.0);

class closest_point { // points closer to origin have higher priority
public:
bool operator()(const point_t& p1, const point_t& p2) const
{
return (p2.dist(origin)<p1.dist(origin));
}
};

tbb::concurrent_priority_queue<point_t, closest_point> my_point_cpq;

In this example, we have elements of type point_t, which have doubles representing 2D Cartesian coordinates, and a special operator for prioritizing points by closeness to the origin. This illustrates how the concurrent priority queue may be used with any element type that can be compared pair-wise in any way.

There are three concurrent (i.e. thread-safe) operations provided by concurrent_priority_queue. Two of these provide the primary functionality of the queue: the push operation takes an element and inserts it in the queue, and the try_pop operation removes the highest priority element from the queue, provided the queue is not empty. The third operation is a reserve that can be used to increase the space allocated for the queue when the expected capacity is known. Below is a simple multiple-producer single-consumer example that illustrates the use of the push and try_pop operations:

Globally:

concurrent_priority_queue<prio_event_t, prio_compare_t> my_event_q;

Multiple Producers:

prio_event_t ev;
while (1) {
// receive a prioritized event
ev = wait_for_event();
// insert event in CPQ
my_event_q.push(ev);
}

Single Consumer:

prio_event_t ev;
while (1) {
// get highest priority event
if (my_event_q.try_pop(ev)) {
// handle the event
handle_event(ev);
}
}

Note the use of try_pop above: it returns a bool value representing the success of the operation. If there is an element to remove from the queue, try_pop will return true and ev will contain that element, but if the queue is empty, try_pop will return false.

Though this example illustrates a single consumer that must process events serially in priority order, there is no such restriction on the usage of concurrent_priority_queue: multiple producers and multiple consumers are also possible.

In addition to these concurrent operations, concurrent_priority_queue also supports several accessors and modifiers that are not intended for concurrent use. For example, we provide accessor methods to get the size and potential capacity of the queue, as well as a test for emptiness. Unfortunately, these accessors may produce values that are out-of-date as soon as they are returned, particularly if another thread is concurrently modifying the queue via another operation. Additionally, the copy constructor and assignment operator can make invalid copies of the source queue if the source queue is concurrently modified by another thread. Other methods, such as clear, swap and shrink_to_fit are also unsafe when threads are operating on the queue concurrently.

Note that Community Preview features such as concurrent_priority_queue are not enabled by default. For a full description of concurrent_priority_queue along with instructions on how to enable it in your code, please see the Community Preview Features appendix in the Intel TBB Reference Manual.

Performance



 

The charts above show the results of performance tests that were executed on an SGI Altix UV-1000 with 32 8-core Nehalem EX packages (256 cores/512 thread contexts). These charts show throughput (number of push and try_pop operations per second) for varying busy-work times. Note that the smaller the busy-work, the more contention the concurrent push and try_pop operations will encounter with each other. As we increase the busy-work, the contention decreases, and the parallel runs should achieve better throughput relative to the serial throughput. However, as we see in each chart, tbb::concurrent_priority_queue (CPQ) scales better than coarse-locked std::priority_queue (STL) for all tested values of busy-work.

We hope you find this new feature useful and we look forward to hearing your feedback on the forum!

For more complete information about compiler optimizations, see our Optimization Notice.