Parallel Building Blocks - 'for' Loop Considerations

So you've used Intel Parallel Amplifier to discover that a sizable percentage of your processing time is being spent inside of a very hairy for loop. Task Parallel or Data parallel algorithm considerations aside, your first reaction is, how can I get this for loop, at the very least, parallelized across some or all of the cores (note: these considerations vary depending on data size - there are cases where parallelizing across all cores does not warrant the overhead in distributing across those cores)? Each of the PBB models offers a different flavor of for loop, each with their similarities and differences to one another.

So, I could easily just show you the syntax and usage for all of the 'for' loops available, but that isn't solving your problem at all. I am a part of a customer centric group of consulting engineers at Intel - we create and find solutions within the set of DPD tools to solve problems. The fact is that regardless of your build environment and development toolchain, we have at least one solution for your parallelism needs (many times there's multiple, hence the need to explain). So before we go into syntax, usage, and what goes on behind-the-scenes, I'm going to take a step back and help you understand exactly what type of for loop you are dealing with and in what build environment it can play well in.

The first part of the decision process template library vs. language extension vs. template library + language extension. Some development groups don't like or refuse to use any templates. That's fine. Others want the purity of a language-extension free build environment and will only use them when the entire standard. No problem. Then we find our favorite group, those that will use both. That really opens the floodgates of options.

Template Library but NO Language Extension

If you want a template library and will not use a language extension for some degree of parallelism for the 'for' loop, you have these two choices.

Intel Threading Building Blocks (TBB) vs. Intel Array Building Blocks (ArBB)


TBB has a high level syntax where you manage tasks, not individual threads for task parallel programming. So this will help you take advantage of parallelism in-between cores. It does not have implicit vectorization, but you do have the following options to add parallelism inside each individual core:

- pragmas
- compiler flags
- API calls from another template library that will vectorize for you
- explicit vectorization (writing your own AVX/SSE code)


ArBB is the data parallel analogue to TBB. It uses TBB for threading. It has a higher-level syntax than TBB that, if used properly in conjunction with 'map' (will go into that in subsequent blogs), allows for both vectorization and threading without you even having to manage tasks. This is well suited for algorithm scientists that do not want to manage specifics of either managing tasks or vectorization.

Language Extension NO template library

Intel Cilk™Plus

The Cilk component of Cilk™Plus has a keyword style syntax that will spawn threads for you to do the 'for' loop computation. There is no implicit vectorization just by using the keyword - it just parallelizes in-between cores. However, you do have additional options for vectorization. The Array Notations component of Cilk™Plus will allow you to vectorize.

Vectorization options for the Cilk component if you need a language extension but NO template library

- pragmas
- compiler flags
- explicit vectorization
- Array Notations component of Cilk™Plus

If you want a language extension and have allowances for template libraries you can use these for vectorization

- pragmas
- compiler flags
- API calls from another template library that will vectorize for you
- Array Notations component of Cilk™Plus
- explicit vectorization (writing your own AVX/SSE code)

If template library vs. language extension vs. both is a non-issue for you, the issue is TIME and DEGREES of freedom. I will discuss that in the next blog post and video.

For more complete information about compiler optimizations, see our Optimization Notice.


Noah Clemons (Intel)'s picture

I think I'm going to do just work days. It's ambitious enough as it is! Thank you for the comment. Here is blog 2:

Clay B.'s picture

This is cool. And an ambitious goal of 100 bogs and videos in 100 days. Does that include weekends and holidays?

This first example was spot on. I'm looking forward to the next 99.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.