Contribution: parallel_for_group

Contribution: parallel_for_group

Hello all,

I would like to introduce a contribution to TBB, available via the TBB Community Project as TBB Contributed Code. The code may be downloaded via svn checkout, find instructions here: http://code.google.com/p/tbbcommunity/source/checkout

Most of my postings have focused on components relating to the simulation project YetiSim, and I have also asked many other questions on this forum. As you may or may not know, I have been abstracting the core of YetiSim into TBB components so that YetiSim will become a simple application of new components, and so that the TBB community may benefit from an abstraction which is useful (hopefully) for more than just simulation.

Today I would like to introduce you to tcc::parallel_for_group (tcc means TBB Community Code). The concept behind parallel_for_group is that you have a composite type, this might be a collection of items, a robot, a machine, or a game board for Conways Game of Life (as in my simple example in the source tree). In concept you want to separate the composite type into multiple grouped parts, and an ungrouped type. The grouped parts and ungrouped parts might be of different types.

Processing may proceed in parallel on the grouped parts, and on the ungrouped part. By requirement, each group is able to be processed independently, and so can ungrouped parts. The algorithm template parallel_for_group will execute a function object with appropriate operator() overloads on ungrouped and grouped items at the same time.

There are many reasons to group items. Perhaps there are dependencies which must be considered, so some initial work is done to find dependencies which will be made up in parallel processing time. Perhaps we wish to make smaller copies of data which have just the information required for processing. For example perhaps we are performing matrix operations, and we divide the matrix into blocks with each block requiring local copies of the information passed to it.

The basic algorithm template is:

template 
typename SeparatorType::separation_type parallel_for_group(const CompositeType& composite, const SeparatorType& separator, const Body& body, const Partitioner& partitioner)

The separator type passed must support the following operation:

separation separate(const CompositeType&)

This function will take a CompositeType, and separate it into un/grouped items and return a struct which contains that information. The separator must publicly inherit from a provided separator class. There is no runtime cost of this class, because it only holds typedefs for later usage by the algorithm templates (I'm not a template guru, so maybe there is a cleaner way).

The result of the separation is fed into a body, which must support the following:

void operator()(SeparatorType::grouped_range_type& x)
void operator()(SeperatorType::ungrouped_range_type& x)

Notice that currently the SeparatorType is passed to parallel_for_group, this is temporary, because I would prefer to return a reference to the separation rather than a copy.

Also please note I have adjusted tbb::blocked_range and tbb::parallel_for to support non-const ranges. The reason for this, is so that users c
an pass un/grouped items which may be directly manipulated by the body. The user can enforce const-ness in the range by specifying that the un/grouped items are pointers to const.

Anyways, this is a basic introduction. Please have a look at the code, and share your thoughts. I have only provided an example here, and wanted to share the code for further input. There is more work to be done on the construct, and I will be implementing parallel_reduce_group in a similar manner.

I'm not the greastest person at communicating technically, so please ask questions and let me answer them. Check out the code, and enjoy!

AJ

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I haven't downloaded your implementation, so my questions come from examining what you've posted. I do have a few.

Are there only two "groups," the grouped and ungrouped collections? Or is CompositeType a hierarchical container, able to be unpealed recursively? How well do you expect this to scale with with growth in available HW threads?

If you're reordering elements to coalesce them into groups and the ungrouped group, does that require copying the elements around in memory, a potentially high cost serial operation?

What's a grouped_range_type and what does the Partitioner do in this environment? Are we talking linked lists here, or is the grouped_range a potentially splittable range?

Yes, there are only two "groups", ungrouped (a collection of UngroupedTypes) and grouped (a collection of GroupedTypes). Realistically I do not know how a person wants to unpeel the CompositeType so it is a template argument. The idea is that the user can specify types for the GroupedType and UngroupedType at runtime, depending upon how they want to separate it.

There is never any reordering of elements in the CompositeType at all. If the CompositeType is a vector for instance, and someone wants to group the numbers into primes, composites, even numbers, and odd numbers, the user would want to create their own struct or class which has an enum to specify what type of group it is. This struct would then become the GroupedType. So the grouped collection would then contain these user-defined types, which would internally contain pointers, references, or values. The user could specify their internal collection to be of type vector to accidental writes to the CompositeType (which is again a vector in this case).

I expect this to scale well, so long as there is an appropriate amount of work such that the separation is worth the effort. Imagine the separation as being a search for parallelizable work. Internally the un/grouped_range_type is a blocked_range from un/grouped.begin to un/grouped.end, which is passed to a parallel_for. Both blocked_range and parallel_for were modified to permit non-const access to the contents of un/grouped data (since the user can declare const pointers for un/grouped types anyways).

This construct is basically just a high-level wrapper for existing TBB components, and is intended to help users look for ways to separate parallelizable work from data when there are dependencies. A solution could of course be done with just a regular parallel_for.

I have attempted to minimize copying of data. I'm still considering memory issues with this regard. This is by no means a completed design.

Hope this answers some questions. By the way, I'm usually on #tbb so we could discuss this there as well if you are interested.

By the way, a good example usage for this Kevin Farnham considered was this:

Consider you have a system where there are areas which you wish to do precise calculations, and the rest of the area you wish to do an imprecise calculation. You could create groups which indicate the areas of interest, and the ungrouped items you would do a quick analysis of instead.

You can also see some motivation on my blog, which I have to update btw, here:

http://blogs.yetisim.org/2008/01/30/tbb-execution_graph-part-4-introduci...

I took a quick look at the blog you referred to. I'm not sure I fully grasp the nature of the dependencies described here, but I fear the constantly changing dependencygroupsthe diagrams illustratewill be difficult to implement without constantly trashing caches. Are the nodes described in this designpreserved at some constant vaddr? Then presumably the group associations are described in some other data structure, possibly an array of node pointers. That data structure must have to be rebuilt with each generation? So a processing element, given a group or the un-group, may be dealing with an arbitrary set of underlying nodes from generation to generation? It's hard to imagine how caches could be preserved/reused in such an organization. Are the nodes themselves a regularly varying population, or do they have some stability, generation to generation? To use a basketball analogy, this sounds like employing a zone defense, taking on any node that lands in my zone (dependency equivalence class); perhaps a man-to-man defense might be better for cache preservation?

I'm not convinced of the usage example cited above, either. The suggestion for grouping precise and imprecise calculations so that one set could be done more quickly, wouldn't that tend to exacerbate load imbalances? I think a better scheme would be to mix big and small computational tasks so that each processing element gets the same amount of work.

Yes, my blog does not convey the problem efficiently. I spent some time forming the exact problem and solution in my mind.

There are definitely some load issues to consider, as well as memory handling. In practice, there would be a single CompositeType held in memory and un/grouped collections would be formed from that CompositeType. In essence, the act of grouping data has an overhead of creating collections of pointers, or perhaps creating new class types for analysis.

This construct should have the same load balancing properties as TBB's parallel_for. In the implementation, two parallel_for instances are forked which go over the grouped and ungrouped items separately.

Here is some ASCII art to show what happens for the division of the composite type (I'm sure it's already clear) when separator.separate() is called. Next, I show a graph of how execution proceeds for parallel_for_group.

+---------------+
| CompositeType |
+---------------+
/
/
/
/Separator.separate()
+=======================================+
| +-------------+ +---------------+ |
| | GroupedType | | UngroupedType | |
| +-------------+ +---------------+ |
| |
| Separation Type |
+=======================================+

+--------------------+
| parallel_for_group |
+--------------------+
|
|/
+--------------------+
| Get separation |
+--------------------+
/
/
+-------+ +-------+
| Child | | Child |
+-------+ +-------+
/
/

+----------------------+ +------------------------+
| Create Grouped Range | | Create Ungrouped Range |
+----------------------+ +------------------------+
| |
+----------------------+ +----------------------+
| call parallel_for on | | call parallel_for on |
| grouped range | | ungrouped range |
+----------------------+ +----------------------+

From the above graph, you can see that execution degrades to a simple parallel_for call. The un/grouped range are just simple blocked_ranges with const-ness removed. Execution wise, parallel_for_group will have the same profile as parallel_for.

Now the cache is another issue. I completely agree the cache will not be used effectively in this example. I have thought on this a bit too, I'm not sure if it is possible to better use the cache with this type of construct, or how to better approach cache usage with this type of problem.

Realistically the separation function itself should use parallelism to improve speeds. Also, it might be possible to modify the source so that un/grouped collections get processed as they available rather than after they have been completed. This would improve throughput.

The intention of parallel_for_group is to provide another layer of abstraction with hopefully low-cost to users of TBB. I'll give one more trivial real-world example.

Consider I am doing subtotaling on a number of key/value pairs, and I want a single total for all key entries. This is a bad example because addition will not have a high cost, but let's consider it anyways. I could do a parallel_reduce operation to create struct's which hold partial sums, and then put these operations all back together at the end. That works just fine. Another approach would be to use parallel_for_group to break the collection into groups by key, and then total each of those independently. In this case the cost of finding groups for execution is higher cost than the operations themselves.

AJ

Leave a Comment

Please sign in to add a comment. Not a member? Join today