Intel® Threading Building Blocks Version 3.0 Update 4 showcases its first Community Preview feature: Concurrent Priority Queue

The Intel® Threading Building Blocks (Intel® TBB) 3.0 Update 4 release introduces a new concept: the Community Preview feature, which gives users of Intel TBB an early look at features targeted for future release and provides the Intel TBB development team with valuable feedback.

In addition to the usual rich palette of algorithms, containers, and other concurrency tools that Intel TBB provides, the version 3.0 U4 release features a new concurrent container, the concurrent priority queue.

A priority queue is a queue of elements, each of which has a priority associated with it. The queue is maintained in order of the priorities on the elements. For example, one might have a stream of incoming events which are immediately enqueued and later dequeued to be processed one-by-one. Some events are more important than others, so we would like those events to be handled earlier. Thus, we might assign a priority to each event in the form of an integer. We then use a priority queue to queue the events, and when we dequeue an element from the queue, it will be the element with the highest priority. This pattern turns up in many applications, but the concept has a variety of other uses.

The C++ Standard Template Library provides a serial std::priority_queue container. However, when we use this type of container in multi-threaded programming, it must be capable of handling the primary operations of a priority queue in a concurrent fashion. The common practice is to wrap std::priority_queue operations with coarse locks, and it is important to note that attempts at bettering the performance of a coarse-locked std::priority_queue often fall short.

Our goal in developing a concurrent priority queue for Intel TBB is to provide concurrent functionality with performance competitive to a coarse-locked std::priority_queue for low numbers of threads, and with better scalability for higher numbers of threads. We believe we have achieved this goal, but we would like Intel TBB users to try this Community Preview Feature and let us know how it works for them in their applications.

Using the Intel TBB Concurrent Priority Queue

The tbb::concurrent_priority_queue container is a template class that lets the user specify both the type of element to be queued and a comparison operator for comparing the priorities of elements. For example:

tbb::concurrent_priority_queue<int, std::less<int> > my_cpq;

The above declaration of my_cpq specifies a queue of integers sorted simply by the std::less<int> operator. Note that the comparison operator should take two arguments of the element type and has the following behavior: If compare(A,B) returns true, then B has higher priority, otherwise A has higher priority.

A more interesting example follows:


class point_t {
public:
double x, y;
point_t(double _x, double _y) : x(_x), y(_y) {}
double dist(const point_t& other);
// other declarations, methods, etc.
};

const point_t origin(0.0, 0.0);

class closest_point { // points closer to origin have higher priority
public:
bool operator()(const point_t& p1, const point_t& p2) const
{
return (p2.dist(origin)<p1.dist(origin));
}
};

tbb::concurrent_priority_queue<point_t, closest_point> my_point_cpq;

In this example, we have elements of type point_t, which have doubles representing 2D Cartesian coordinates, and a special operator for prioritizing points by closeness to the origin. This illustrates how the concurrent priority queue may be used with any element type that can be compared pair-wise in any way.

There are three concurrent (i.e. thread-safe) operations provided by concurrent_priority_queue. Two of these provide the primary functionality of the queue: the push operation takes an element and inserts it in the queue, and the try_pop operation removes the highest priority element from the queue, provided the queue is not empty. The third operation is a reserve that can be used to increase the space allocated for the queue when the expected capacity is known. Below is a simple multiple-producer single-consumer example that illustrates the use of the push and try_pop operations:

Globally:

concurrent_priority_queue<prio_event_t, prio_compare_t> my_event_q;

Multiple Producers:

prio_event_t ev;
while (1) {
// receive a prioritized event
ev = wait_for_event();
// insert event in CPQ
my_event_q.push(ev);
}

Single Consumer:

prio_event_t ev;
while (1) {
// get highest priority event
if (my_event_q.try_pop(ev)) {
// handle the event
handle_event(ev);
}
}

Note the use of try_pop above: it returns a bool value representing the success of the operation. If there is an element to remove from the queue, try_pop will return true and ev will contain that element, but if the queue is empty, try_pop will return false.

Though this example illustrates a single consumer that must process events serially in priority order, there is no such restriction on the usage of concurrent_priority_queue: multiple producers and multiple consumers are also possible.

In addition to these concurrent operations, concurrent_priority_queue also supports several accessors and modifiers that are not intended for concurrent use. For example, we provide accessor methods to get the size and potential capacity of the queue, as well as a test for emptiness. Unfortunately, these accessors may produce values that are out-of-date as soon as they are returned, particularly if another thread is concurrently modifying the queue via another operation. Additionally, the copy constructor and assignment operator can make invalid copies of the source queue if the source queue is concurrently modified by another thread. Other methods, such as clear, swap and shrink_to_fit are also unsafe when threads are operating on the queue concurrently.

Note that Community Preview features such as concurrent_priority_queue are not enabled by default. For a full description of concurrent_priority_queue along with instructions on how to enable it in your code, please see the Community Preview Features appendix in the Intel TBB Reference Manual.

Performance



 

The charts above show the results of performance tests that were executed on an SGI Altix UV-1000 with 32 8-core Nehalem EX packages (256 cores/512 thread contexts). These charts show throughput (number of push and try_pop operations per second) for varying busy-work times. Note that the smaller the busy-work, the more contention the concurrent push and try_pop operations will encounter with each other. As we increase the busy-work, the contention decreases, and the parallel runs should achieve better throughput relative to the serial throughput. However, as we see in each chart, tbb::concurrent_priority_queue (CPQ) scales better than coarse-locked std::priority_queue (STL) for all tested values of busy-work.

We hope you find this new feature useful and we look forward to hearing your feedback on the forum!

For more complete information about compiler optimizations, see our Optimization Notice.

9 comments

Top
Terry Wilmarth (Intel)'s picture

Hi ilnarb,
We currently do not have plans to create a bounded version, but it may be possible. If you are interested in that, please raise the issue for discussion on the forum.

Yes, try_pop will return false if the queue is empty and no pushes were performed recently from other threads. The calling thread may spin-wait shortly while another thread handles the try_pop or other simultaneous operations.

Ilnar's picture

Is it able to create bounded version of priority queue? like concurrent_bounded_queue, plus priority feature?
as I see, try_pop is non blocking while queue is empty, and there are spin wait on while(1) {...}. is my assumption correct?

Terry Wilmarth (Intel)'s picture

Thanks Dmitriy. Good points, and good suggestions, too. Yes, this CPQ implementation is general-purpose, allowing any type of priority. It is also sequentially consistent and linearizable,with respect to the push and try_pop operations, which is the main reason why pushes aren't implemented as you suggest. It's a matter of some debate as to whether linearizable operations are important to users, so we'd like to hear feedback on that too.
We tried some of the other approaches you mentioned in earlier attempts. Hopefully I will have results in the form of a paper to share soon.

Dmitry Vyukov's picture

Hi Terry,

I've shared my thought on concurrent priority queues in general and on tbb queue in particular here:
http://www.1024cores.net/home/lock-free-algorithms/queues/priority-queues

Terry Wilmarth (Intel)'s picture

Hi Dmitriy,
> Why did not you test it on 512 threads?

There is an undiagnosed problem on the machine I used, and I was unable to get a full set of test results at 512 threads.

> What is the range and distribution of priorities used in the tests?

Elements to be pushed onto the queue were generated serially in advance, and rand() was used to generate priorities in the range 0-99. The elements themselves contain an integer priority and a padding array to cache line size.

> How the tests are organized? In particular, what is the mean queue length maintained during the tests?

Each thread performed a loop where operations on the queue were performed until a time window had elapsed (in these runs, 30 seconds). In each iteration of the loop, a thread performed K operations, busywaiting for the specified time between each operation: first K/2 pushes, then K/2 pops. For these runs, K=20.

I did not collect data on mean queue size throughout the run, but given the value of K above, the mean queue size is fairly small, since it is between 0 and (K/2)*N, where N is the number of threads. I have preloaded the queue with a large number of elements in the past, but I don't have ready results for the same hardware.

I hope that helps!
Terry

Dmitry Vyukov's picture

Sorry for asking so much questions, but it's crucial for results interpretation.

Dmitry Vyukov's picture

Ok.
Why did not you test it on 512 threads?
What is the range and distribution of priorities used in the tests?
How the tests are organized? In particular, what is the mean queue length maintained during the tests?
TIA.

Dmitry Vyukov's picture

Where I can download it? On official site I see only v3.0u3.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.