Hierarchically Tiled Arrays and Threading Building Blocks

A Hierarchically Tiled Array (HTA) "is a class designed to facilitate the writing of programs based on tiles in object-oriented languages. HTAs allow to exploit locality as well as to express parallelism with much less effort than other approaches." This description, from the HTA project's home page, instantly suggests that HTAs might be suitable for parallelization using Threading Building Blocks. The project team thought so too, and the entry submitted to the "Coding with Intel TBB" Contest by Basilio Fraguela was one of the contest winners.

Hierarchical data: parallelism and locality

If you download the htalib.smp package from the HTA Downloads page, your unpacked directory structure contains a doc directory with some interesting presentations and other reference materials. File hta.pdf is a presentation titled "Programming for Parallelism and Locality with Hierarchically Tiled Arrays." Slide 2 presents the problem the HTA team is trying to solve:

Problem Statement

    • To design & develop language extensions for parallelism

    • To design & develop language extensions for locality

    • To quantify the Performance loos due to abstraction

This sounds a lot like what TBB offers; but the problem is being approached with a somewhat different focus by the HTA team. The type of language extension they are working on is a data type. But it's a data type whose design meshes perfectly with TBB's parallelization strategies.

Consider how TBB looks at a loop that can be parallelized. My task stealing experiments (see my Intro, Demo, and Further Analysis posts) demonstrated the TBB scheduling concept of dividing tasks into blocks, then subdividing as necessary. Slides 32-37 of the OSCON TBB tutorial presentation (available in PDF format at http://softwarecommunity.intel.com/isn/downloads/OSCON2007-TBB-final.pdf) describe and illustrate TBB's data partitioning for a parallel_for() blocked_range:

TBB Data/Task Chunking

Compare this diagram with Slide 9 ("Recursive Tiling") from the "Programming for Parallelism and Locality with Hierarchically Tiled Arrays" presentation:

Pictorial View

We see that a Hierarchically Tiled Array is a data type that appears ideally-suited for processing using Threading Building Blocks.

The problem of recursive depth

One of the features of HTAs is that an individual element of an HTA can be another HTA. In other words, this is a recursive data type.

One of the problems that can be difficult to overcome with threading of complex software using traditional methods (i.e., raw threads) is the situation where the processing for some data elements takes a lot longer than the processing for other data elements. This can lead to a situation where multiple processors are idle, awaiting the completion of other processors that happened to be assigned tasks that have taken much longer to complete.

The recursive nature of HTAs could pose a problem if you try to process them using a simple top-level threading. What happens if one thread is assigned an HTA segment that has much greater recursive depth than others?

TBB provides a ready solution to this problem with its dynamically applied "task stealing." When a processor becomes idle, TBB looks for a data chunk that isn't currently being worked on, and "steals" that task assignment from the processor it had originally been assigned to.

TBB's strategy is to "work depth first; steal breadth first." That is, when a processor becomes idle, TBB looks for a large, unprocessed section of data that is far from the sections of data that are currently being worked on:

TBB Task Stealing Strategy

Clearly, this will automatically resolve the problem of processing inefficiency (due to idle processors) when some portions of a hierarchically tiled array have significantly greater recursive depth than other portions.

A quick look at the code

Going into the htalib/src directory and executing a grep tbb * shows us 36 instances of tbb::parallel_for and tbb::blocked_range. Indeed the nature of hierarchically tiled arrays suggests that the TBB parallel_for() construct may well be all that's necessary for efficiently processing data stored in HTA structures.

There are more than 24000 lines of HTA source code, so I can't go into the details of how TBB was integrated into htalib. Suffice to say, HTAs appear to be an ideal data type for parallel processing, and the structure of HTAs and TBB's methods of processing tasks in parallel appear to be a perfect match.

Kevin Farnham
O'Reilly Media
TBB Open Source Community

For more complete information about compiler optimizations, see our Optimization Notice.