<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blogs &#187; Arch Robison (Intel)</title>
	<atom:link href="http://software.intel.com/en-us/blogs/author/arch-robison/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<lastBuildDate>Fri, 25 May 2012 22:49:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Specification for Low Overhead Tool Annotations Released</title>
		<link>http://software.intel.com/en-us/blogs/2011/11/11/specification-for-low-overhead-tool-annotations-released/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/11/11/specification-for-low-overhead-tool-annotations-released/#comments</comments>
		<pubDate>Fri, 11 Nov 2011 15:17:30 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/11/11/specification-for-low-overhead-tool-annotations-released/</guid>
		<description><![CDATA[Program analysis tools can be valuable for debugging program correctness and performance issues, even more so for multi-threaded programs.   Some of these tools need to know about certain events in the program. For example, race detection for Intel® Cilk™ Plus programs requires knowing precisely when spawn and sync events happen.  Similar events are necessary to analyze [...]]]></description>
			<content:encoded><![CDATA[<p>Program analysis tools can be valuable for debugging program correctness and performance issues, even more so for multi-threaded programs.   Some of these tools need to know about certain events in the program. For example, race detection for Intel® Cilk™ Plus programs requires knowing precisely when spawn and sync events happen.  Similar events are necessary to analyze Intel® TBB programs and OpenMP. </p>
<p>To facilitate interoperation of compilers and such tools, we have posted an open specification <a href="http://software.intel.com/en-us/articles/intel-cilk-plus-specification/"><span style="color: #0000ff;">open specification</span></a> of a feature for communicating such events from the compiled code to the analysis tool.  The feature consists of two compiler intrinsics and an extension to the executable format.  We're already using the feature in:</p>
<ul>
<li>The Intel compiler implementation of Cilk Plus.</li>
<li><a href="http://software.intel.com/en-us/articles/intel-parallel-inspector/">Intel® Inspector</a></li>
<li><a href="http://software.intel.com/en-us/articles/intel-cilk-plus-software-development-kit/">Intel® Cilk™ screen and Intel® Cilk™ view</a>.</li>
</ul>
<p>and of course working on adding it to the <a href="http://gcc.gnu.org/svn/gcc/branches/cilkplus/README.cilk">Cilk Plus development branch of GCC</a>, so that code generated by that GCC branch will be analyzable by the aforementioned Intel tools.  See the link "Intrinsics for Low Overhead Tool Annotations" on the page <a href="http://software.intel.com/en-us/articles/intel-cilk-plus-specification/">Intel<sup>®</sup> Cilk™ Plus Specification</a> for the specification. </p>
<p>The feature is not limited to Cilk Plus -- it's a general mechanism for communicating events.  Better yet, it exists as a compiler intrinsic function, so that you can mark up your own code with your own events, and not just rely on ones created by the compiler.  Of course you’ll need an analysis tool that understands your events.  We’re working on a library to make it easy for users of  ﻿﻿<a href="http://www.pintool.org/"><span style="color: #0000ff;">Pin</span></a> to decode the events.</p>
<p>The feature avoids the overhead of the traditional approach of reporting an event with a subroutine call, which incurs run-time overhead even if the subroutine does nothing when the program is <em>not</em> being analyzed.  The feature creates a table in the executable of where the events are, and optionally a no-op instruction that can be overwritten by a tool like ﻿﻿<span style="color: #0000ff;">Pin</span> if you need to instrument the event.  We've been careful to design the extension to the executable format so that only the compiler and analysis tool need to understand it.  Other parts of the tool chain such as the assembler and linker do not have to understand it, unless they attempt a level of code rewriting not normally seen in such tools.</p>
<p>The specification is intended to enable compilers to generate the annotations and program analysis tools to consume the annotations.   As an example, it shows how we use the annotation mechanism to mark﻿ Intel® Cilk™ Plus programs. </p>
<p>So if you are interested in program analysis tools, please take a look at the specification.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/11/11/specification-for-low-overhead-tool-annotations-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New Rules for Array Sections in Intel(R) Cilk(TM) Plus</title>
		<link>http://software.intel.com/en-us/blogs/2011/07/26/new-rules-for-array-sections-in-intelr-cilktm-plus/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/07/26/new-rules-for-array-sections-in-intelr-cilktm-plus/#comments</comments>
		<pubDate>Tue, 26 Jul 2011 17:54:19 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Performance and Optimization]]></category>
		<category><![CDATA[array notation]]></category>
		<category><![CDATA[Cilk Plus]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/07/26/new-rules-for-array-sections-in-intelr-cilktm-plus/</guid>
		<description><![CDATA[Fans of Cilk Plus or language specifications may be interested in the revised specification of Intel® Cilk™ Plus posted at http://software.intel.com/file/37679/Intel_Cilk_plus_lang_spec_2.htm .   Clark Nelson did most of the work for turning the previous specification into something closer to standardese and illuminating ambiguities in the previous specification.  I'll mention two important changes that the new specification to [...]]]></description>
			<content:encoded><![CDATA[<p>Fans of Cilk Plus or language specifications may be interested in the revised specification of Intel® Cilk™ Plus posted at http://software.intel.com/file/37679/Intel_Cilk_plus_lang_spec_2.htm .   Clark Nelson did most of the work for turning the previous specification into something closer to standardese and illuminating ambiguities in the previous specification.  I'll mention two important changes that the new specification to improve the language extension.  One permits compilers to generate more efficient code.  The other resolves a fundamental conflict that array sections brought up.</p>
<h2>More Efficient Handling of Array Sections</h2>
<p>Suppose p and q are pointers.  Consider the following array-section assignment:</p>
<pre>p[0:n] = q[0:n]+1;
</pre>
<p>This statement sets p[i]=q[i]+1 for i in {0...n-1}.  But what happens if these two sequences overlap in memory?   In our original specification, we followed the practice of APL and Fortran 90, and said that the right side must be evaluated first before it is assigned to the left side.  This practice makes the example well defined regardless of whether there is overlap.</p>
<p>However, a key principle of C++ is "abstraction with minimal penalty".  The APL and Fortran 90 approach violates this principle, because it requires the compiler to generate a temporary array for the right side result whenever the compiler cannot <em>prove </em>that there is no overlap.  In many practical situations, the compiler cannot prove absence of overlap, even though the programmer knows there is no overlap.   Thus with the old specification, the programmer ended up paying a penalty for using array notation.</p>
<p>The revised specificaiton makes the partial overlap case undefined.  The compiler is free to compile the example without generating a temporary array.  Indeed, under some circumstances, the the array sections can conceivably be compiled into more efficient code than their Fortran counterparts.  Now it is a programmer error if p and q point to partially overlapping arrays.  However, if the overlap is <em>exact, </em>the code is well defined.  This is important for keeping the equivalence of common update idioms like "p[0:n]+=1" and "p[0:n] = p[0:n]+1". </p>
<p>The new specification makes the overlap rules for array sections consistent with the existing ones for structs/classes/unions.  If I write "*p = *q" and the pointers point to structs, the assignment is well defined only if p and q point to non-overlapping structures, or point to exactly the same structure.   That way, the compiler can avoid generating unnecessary temporary structures.  Now the rule and its benefits apply to array sections too.</p>
<p>The change will break existing code  that relied on the temporary arrays.  It's an inconvenience that in some cases might require revising existing code, but I believe it's justified by the greater long-term good for Array Notation.</p>
<h2>Rank of "Pointer + Int" Resolved</h2>
<p>In Array Notation, "x op y" is normally performed elementwise if x or y is an array section.  For example, "z[0:n] = x[0:n]+y[0:n];" sets z[i] to x[i]+y[i] for i in {0..n-1}.   One-dimensional sections are said to have rank 1.  The rule extends to multidimensional arrays.  For instance, "a[0:m][0:n] = x[0:m][0:n] + y[0:m][0:n];" does elemntwise addition and assignment across two dimensions.  The array sections are said to have rank 2.  In general, in "x op y", if both x and y have non-zero rank, they must match, and the rank of the result is the same. </p>
<p>But there is a an important exception for the subscript operator.  Suppose <em>x </em>and <em>i </em>are expressions with non-zero rank.  The rank of x[i] must be the sum of the ranks of x and i.  This is because expressions like x[0:m][0:n] are equivalent to (x[0:m])[0:n].  That is, subscripting is always a one-dimensional operation in C++, and multiple dimensions are "faked" by multiple subscript operations. </p>
<p>So far, this is plain stuff, and seems non-controversial.  The tricky issue is determining the rank of p+i when p is a pointer and i is an integer, both with non-zero rank.  Should it follow the normal "x op y" rule, or the array subscript rule?  Either way breaks one of two fundamental identities in C++:</p>
<ol>
<li>(p+i)+j == p+(i+j)</li>
<li>p[i] == *(p+i)</li>
</ol>
<p>Following the "x op y" rule preserves identity 1, but breaks identity 2.  To see this, suppose p, i, and j are expressions with rank 1.  Then both sides of identity 1 have rank 1 under the "x op y" rule.  But in identity 2, the sides have different rank: "p[i]" has rank 2 and "*(p+i)" has rank 1.</p>
<p>An alternative would be to treat "p+i" similarly to "p[i]", and say that the rank is the sum of the ranks of the operands.  That preserves identity 2.  But it breaks 1, because then "(p+i)+j" has rank 3, but "p+(i+j)" has rank 2.</p>
<p>So one of the identities must break.  The tie-breaker is consideration of templates and possible future extension to user-defined operators.  In those cases, we want a uniform rule for "x + y" that does not depend on the types of x and y.  So we broke identity 2, and kept rank of "p+i" is the same as for any other "p op i".  Subscripting is the only special operator with respect to rank.</p>
<h2>Summary</h2>
<p>The revised specification resolves some important issues in Array Notation.  It should make it much clearer to both users and implementers, and enable "abstraction with minimal penalty" for Array Notation.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/07/26/new-rules-for-array-sections-in-intelr-cilktm-plus/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lambda + std::move + Array Notation</title>
		<link>http://software.intel.com/en-us/blogs/2011/05/05/lambda-stdmove-array-notation/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/05/05/lambda-stdmove-array-notation/#comments</comments>
		<pubDate>Thu, 05 May 2011 17:06:16 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Performance and Optimization]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/05/05/lambda-stdmove-array-notation/</guid>
		<description><![CDATA[Here is an interesting example of combining new language features. I needed to write a routine similar to std::copy, except that the routine needed to destroy its source sequence; i.e., revert it to raw memory.  I was amused that I could exploit both C++ 2011 and the Intel Array Notation extension to give the compiler [...]]]></description>
			<content:encoded><![CDATA[<p>Here is an interesting example of combining new language features.</p>
<p>I needed to write a routine similar to std::copy, except that the routine needed to destroy its source sequence; i.e., revert it to raw memory.  I was amused that I could exploit both C++ 2011 and the Intel Array Notation extension to give the compiler wider license to optimize the code.   Here is the code: </p>
<pre name="code" class="cpp">template&lt;typename T&gt;
T* destructive_move( T* first, T* last, T* output ) {
    size_t n = last-first;
    []( T&amp; in, T&amp; out ){
        out = std::move(in);
        in.~T();
    }( first[0:n], output[0:n] );
    return output+n;
}</pre>
<p>I used the Intel 12.0 compiler and Microsoft Visual Studio 2010 to compile it.  The code combines three new features:</p>
<ul>
<li>A <strong>C++11 lambda expression</strong>:  That's the code introduced by  [], and creates a functor object.  The functor has two formal parameters in and out.</li>
<li>A C++11 <strong>"move assignment"</strong>: It moves the contents of in to out, and gives license to operator= to mangle the value of in for sake of optimization.   This can be a significant performance advantage over a C++98 copy assignment  "out = in".  For example, if T is a std::vector&lt;U&gt; with <em>N</em> elements, a "copy assignment" of T takes O(<em>N</em>) time, but "move assignment" takes only O(1) time, because it can steal the underlying hunk of N elements instead of copying them.  Furthermore, often "move assignment" can be guaranteed to not throw an exception, because it steals resources instead of allocating more resources that might run out.</li>
<li>Intel®  Cilk Plus<strong> <a href="http://software.intel.com/en-us/blogs/2010/09/03/simd-parallelism-using-array-notation/">Array Notation</a></strong>: It maps the functor over first[0:n] and output[0:n].  Each of these represents a sequence of n elements, starting at addresses <code>first</code> and <code>output</code> respectively.  Array Notation is an Intel extension that avoids the need for an explicit loop, and gives the compiler license to vectorize the code.   The only drawback is that if an exception is thrown, it's not clear how many elements of in[o:n] were actually destroyed.  Though for the cases of interest to me, neither the "move assignment" nor explicit destructor ~T() can throw an exception. </li>
</ul>
<p>I could have written the routine as:</p>
<pre name="code" class="cpp">template&lt;typename T&gt;
T* destructive_move( T* first, T* last, T* output ) {
    size_t n = last-first;
    output[0:n] = std::move(first[0:n]);
    first[0:n].~T();
    return output+n;
}</pre>
<p>which is more concise, but has the drawback of making two sweeps over memory.</p>
<p>More about Cilk Plus can be found <a href="http://software.intel.com/en-us/articles/intel-cilk-plus/">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/05/05/lambda-stdmove-array-notation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Detecting Theft by Hyperobject Abuse</title>
		<link>http://software.intel.com/en-us/blogs/2010/11/22/detecting-theft-by-hyperobject-abuse/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/11/22/detecting-theft-by-hyperobject-abuse/#comments</comments>
		<pubDate>Mon, 22 Nov 2010 17:56:10 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Cilk]]></category>
		<category><![CDATA[Cilk Plus]]></category>
		<category><![CDATA[Holder]]></category>
		<category><![CDATA[Hyperobject]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/11/22/detecting-theft-by-hyperobject-abuse/</guid>
		<description><![CDATA[Holder hyperobjects can be abused to detect work stealing.]]></description>
			<content:encoded><![CDATA[<p>Intel® Cilk™ Plus employs work stealing, where threads steal work from other threads.  Though a good Intel Cilk Plus program should not depend on whether work is stolen or not, you might be curious about when it occurs in a program.  This blog shows how to satisfy that curiousity with a <em>holder</em> hyperobject, a generally useful abstraction that I'll abuse somewhat to detect stealing.  </p>
<p>Hyperobjects are Cilk's way of doing parallel reductions.  The best reference on them is the award winning paper "<a href="http://www.fftw.org/~athena/papers/hyper.pdf">Reducers and Other Cilk++ Hyperobjects</a>".   I'll summarize their proper use as background for their abuse. </p>
<p>Intel® Cilk™ Plus allows control flow to fork into multiple <em>strands </em>of execution.  A hyperobject is a special kind of object for which there are multiple views.  Any two concurrently executing strands get separate views that they can safely update without locking.  Here is a trivial example:</p>
<pre name="code" class="cpp">#include &lt;cilk/cilk.h&gt;
#include &lt;cilk/reducer_opadd.h&gt;

cilk::reducer_opadd&lt;int&gt; X;

void f() {
    X += 1;
}

int g() {
    cilk_spawn f();
    X += 2;
    cilk_sync;
    return X.get_value();
}</pre>
<p>Variable X is declared as a hyperobject for doing addition reduction over type int.    Here is what happens when function g() is called and there is an idle thread that successfully steals.</p>
<ol>
<li>The <tt>cilk_spawn</tt> causes control flow to fork into two strands of execution.  One strand calls f(), which executes <tt>X+=1</tt>.</li>
<li>The other strand is a <em>continuation</em> of the caller's execution.  The idle thread may steal the continuation and executes "X+=2".  If stealing does not occur, the strand executes after f() returns.</li>
<li>Execution waits at the <tt>cilk_sync</tt> until both strands complete. </li>
<li>The value of <tt>X</tt> is returned. </li>
</ol>
<p>As a practical matter, stealing is unlikely in this example because f() executes so quickly that the original thread will get to execution of the continuation before a thief can grab it.  But for exposition's sake, assume that <tt>+=</tt> is slow. </p>
<p>If X were an ordinary <tt>int</tt>, having two strands concurrently update <tt>X</tt> would be unsafe, because one of the updates might stomp on the other.  Declaring it as a hyperobject avoids the problem.  When the thief operates on <tt>X</tt>, it gets a gets a fresh view, initialized to 0, the identity element for addition.   The cilk_sync causes the views to be merged into a single view.  The declaration of <tt>X</tt> implies merging by addition, so the net effect of calling g() is <tt>X+=3</tt>.  </p>
<p>If the continuation is <em>not </em>stolen, a fresh view is not created.  The <tt>X+=1</tt> happens first, followed by <tt>X+=2</tt>, both on the same view.  Thus the next effect of g() is still <tt>X+=3</tt>.</p>
<p>The reduction operation should be associative, or as in the case of floating-point addition, practically associative for given circumstances.  But it need not be commutative.   When two views merge, the reduction operation is always applied such that the left operand is the view for the spawned routine and the right operand is the view for the stolen continuation.  </p>
<p>Now about detecting steals.  The idea is to detect when a view is fresh.  I'll use a global boolean hyperobject "Seen" for this purpose.   I'll use "Seen==false" to indicate that a view is fresh.  This convention is simplifies the code by exploiting default initialization of new views, so I do not have to explicitly specify their value.  </p>
<p>The domain of the "reduction" is boolean values.  The reduction operation is "<em>x</em> op <em>y</em> → <em>x</em>".  It's associative but not commutative.  A hyperobject with this reduction operation is called a <em>holder</em>, because it holds the left value.   The right value is irrelevant because it corresponds to the view created by a thief, and inspected by that thief.    (Exercise for the mathematically inclined: do left and right identity values exist for this operation?  What are they?)</p>
<p>Here is a complete example of using a holder to detect steals.  You can compile and run with Intel(R) Cilk Plus compiler.    </p>
<pre name="code" class="cpp">#include &lt;cilk/cilk.h&gt;
#include &lt;cilk/reducer.h&gt;

template&lt;typename U&gt;
class Holder {
    struct Monoid : cilk::monoid_base&lt;U&gt; {
        static void reduce(U *left, U *right) {}
    };

    cilk::reducer&lt;Monoid&gt; impl;
public:
     inline U&amp; get_view() {
        return impl.view();
    }
};

Holder&lt;bool&gt; Seen;

#include &lt;cstdio&gt;

int main() {
    Seen.get_view() = true;
    cilk_for( int i=0; i&lt;100000000; ++i ) {
        bool&amp; x = Seen.get_view();
        if( !x ) {
            std::printf("Iteration %d was stolen\n",i);
            x = true;  // Must not forget this part.
        }
    }
    return 0;
}</pre>
<p> Here is an explanation of the program's parts:</p>
<ul>
<li>Template <tt>Holder&lt;U&gt;</tt> defines a holder for views of type U, using the template <tt>cilk::monoid_base</tt> defined in <tt>&lt;cilk/reducer.h&gt;</tt>.  </li>
<li>Template class <tt>cilk::reducer&lt;Monoid&gt;</tt> requires that signature <tt>Monoid::reduce</tt> compute "*left = *left <em>op</em> *right".  The implementation is trivial for the holder reduction operation  "<em>x</em> op <em>y</em> → <em>x</em>".</li>
<li>The views conceptually live in <tt>Holder::impl</tt>.  A strand invokes <tt>impl.view()</tt> to get a reference to its view.</li>
<li><tt>Seen</tt> is declared as a <tt>Holder&lt;bool&gt;</tt>.  The views of the bool live in <tt>Seen.impl</tt>.  The initial view is default initialized to false.</li>
<li>Function <tt>main</tt> marks the initial view as seen.</li>
<li>Function <tt>main</tt> executes a <tt>cilk_for</tt> loop, which parcels out chunks of iterations as work.</li>
<li>Each iteration inspects its view of <tt>Seen</tt>.  If the view is false, then it is a freshly created view, which indicates that the chunk was stolen.  The view is  marked as seen after the theft is reported.</li>
</ul>
<p>The code abuses hyperobjects in the sense that its visible behavior depends on whether steals happen or not.  Not all uses of holders are abusive.  Consider a scratchpad variable that is used for temporary storage, but its final value does not matter.  Changing the variable to a holder enables a Cilk program to safely operate on it, without any locks, because each thread will get its own view as necessary.  Furthermore,  each view is operated on in the left-to-right order of the original program.  For some applications, that's a valuable property that enables maintaining complex state in the scratchpad, which is not necessarily a practical thing to do with thread-local storage.</p>
<p>Footnote: Intel and Cilk are trademarks of Intel Corporation in the U.S. and/or other countries.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/11/22/detecting-theft-by-hyperobject-abuse/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Thread Parallelism Using Cilk Notation for C/C++</title>
		<link>http://software.intel.com/en-us/blogs/2010/09/13/thread-parallelism-using-cilk-notation-for-cc/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/09/13/thread-parallelism-using-cilk-notation-for-cc/#comments</comments>
		<pubDate>Tue, 14 Sep 2010 03:43:39 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Cilk]]></category>
		<category><![CDATA[Seismic Duck]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/09/13/thread-parallelism-using-cilk-notation-for-cc/</guid>
		<description><![CDATA[Getting top performance out of  modern processors requires both SIMD and thread  parallelism.   Intel® Cilk Plus is an easy way to express both.   My first blog covered the SIMD part.   This blog explains the thread part. Background As outlined in the first blog, fork-join and SIMD parallelism can be combined to solve a [...]]]></description>
			<content:encoded><![CDATA[<p>Getting top performance out of  modern processors requires both SIMD and thread  parallelism.   Intel® Cilk Plus is an easy way to express both.   My <a href="http://software.intel.com/en-us/blogs/2010/09/03/simd-parallelism-using-array-notation/">first blog</a> covered the SIMD part.   This blog explains the thread part.</p>
<h2>Background</h2>
<p>As outlined in the first blog, fork-join and SIMD parallelism can be combined to solve a computational problem:</p>
<ul>
<li><strong>fork-join</strong> parallelism can specify which subproblems can be solved in parallel by different hardware threads.</li>
<li><strong>SIMD</strong> parallelism can be applied to each subproblem, which helps the compiler exploit SIMD instructions.</li>
</ul>
<p>In fork-join parallelism, control flow forks into separate flows, and later the flows join back together.  Recursive fork-join can be particularly effective.  For example, 10 levels of two-way splits creates 1024-way potential parallelism.  1024 separate hardware<em> threads</em> would be inefficient.  But systems like Cilk (and TBB)  are careful to turn potential parallelism into actual parallelism only when necessary.  Indeed with these systems you should specify as much fork-join parallelism as you can, and let the system determine when, where, and how much to exploit.</p>
<p>An important point is that the subproblems should fit in cache, otherwise multiple threads can easily run up against memory bandwidth limitations.  Sometimes it is hard to determine what a cache-sized subproblem is.  When in doubt, go as small as practical.  The subproblems are too small if the overhead of scheduling a chunk  becomes significant compared to the work for solving the subproblem serially.  Because Cilk Plus is tightly integrated into the compiler, and has highly structured parallelism, it tends to have lower chopping overhead than systems like TBB that are less structured and supplied as  libraries.  So you can often afford to chop into finer subproblems than you might with a system such as TBB.  [If you are interested in a deep comparison of Cilk vs. TBB, let me know, and I'll post some blogs on the subject.]</p>
<p>A particularly effective form of chopping is exhibited by <a href="http://en.wikipedia.org/wiki/Cache-oblivious_algorithm"><em>cache oblivious algorithms</em></a>.   These optimize for all levels of cache  that might exist (including virtual memory!), oddly enough without knowing anything about the levels.  See the link for more information.</p>
<h2>Quick Introduction to Cilk Notation</h2>
<p>There are two ways to express fork-join parallelism in Cilk:</p>
<ul>
<li><tt>cilk_spawn</tt> / <tt>cilk_sync</tt></li>
<li><tt>cilk_for</tt></li>
</ul>
<p>For example, the statement <B><tt>cilk_spawn f();</tt></B> asynchronously calls a function <tt>f</tt>.  Here <em>asynchronous</em> means that thethe caller keeps going without waiting for the callee to return.   This is a neat way to express parallelism because most of what you learned about subroutine calls still applies.  The actual arguments are evaluated, bound to formal arguments, and the callee's body is invoked.    The only thing that changes is that the caller continues to execute after the callee is entered, if there is a hardware thread available to continue execution of the caller.</p>
<p>Eventually a caller must wait on its callees.  To do so, the caller invokes <tt>cilk_sync</tt>.  For example, the following code lets <tt>f(x)</tt>, <tt>g(y)</tt>, and <tt>h(z)</tt> run in parallel if there are sufficient hardware resources. It waits for them to complete before printing their results.</p>
<pre name="code" class="cpp">a = cilk_spawn f(x);
b = cilk_spawn g(y);
c = h(z);
cilk_sync;
cout &lt;&lt; a &lt;&lt; b &lt;&lt; c &lt;&lt; "\n";</pre>
<p>Here is a picture that shows the execution order.<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/09/flow2.png"><img class="aligncenter size-full wp-image-18620" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/09/flow2.png" alt="" width="457" height="193" /></a></p>
<p>The last call is not spawned as a matter of good Cilk style.  It could be spawned, but doing so  is considered poor style, because there is no other work to do between that call and the <tt>cilk_sync</tt>.  Doing nothing in parallel with doing something is pointless overhead.</p>
<p>The semantics of <tt>cilk_sync</tt> is that it waits for all <tt>cilk_spawn</tt> calls that were issued by the current <em>Cilk block</em>. A Cilk block is the body of a function, or the body of a <tt>cilk_for</tt> loop. Complex control flow is allowed between <tt>cilk_spawn</tt> and <tt>cilk_sync</tt>.  For example, this code:</p>
<pre name="code" class="cpp">for( list::iterator i=x.begin(); i!=x.end(); ++x )
    cilk_spawn f(*i);
cilk_sync;</pre>
<p>walks a linked list <tt>x</tt> sequentially and issues calls <tt>f(*i)</tt> that run concurrently.  Execution waits for all of them to complete at the <tt>cilk_sync</tt>.</p>
<p>There is always an implicit <tt>cilk_sync</tt> when a routine returns.  Hence a callee cannot accidentally leave "dangling tasks" running after it returns. This feature makes it much easier to reason about Cilk parallelism than it is to reason about raw threads.</p>
<p>The callee can be function object (functor) too.  Since a C++1x lambda expression returns a function object, you can also spawn a block of work like this:</p>
<pre name="code" class="cpp">cilk_spawn [&amp;]{for( int i=0; i&lt;5; ++i ) cout&lt;&lt;"quack ";} ();
other_work(); // Do other work while quacking.
cilk_sync;</pre>
<p>In the example, the lambda expression [&amp;]{...} returns a function object.  The  cilk_spawn...() causes it to execute asynchronously.  Do not forget the trailing parentheses, or the compiler will balk.</p>
<p>The notation <tt>cilk_for</tt> is like a C/C++ <tt>for</tt>, except that iterations run in parallel if resources permit.   For example:</p>
<pre name="code" class="cpp">cilk_for( vector::iterator i=x.begin(); i!=x.end(); ++x )
    f(*i);</pre>
<p>allows, but does not mandate, each loop iteration to run in parallel.  A <tt>cilk_for</tt> loop is usually more efficient than a serial <tt>for</tt> loop wrapped around <tt>cilk_spawn</tt> requests.  The reason is that a <tt>cilk_for</tt> can distribute the work across processors in parallel, and much more efficiently, than the combination of <tt>for</tt> and <tt>cilk_spawn</tt>.  However, a <tt>cilk_for</tt> is more limited than a plain <tt>for</tt>.  The iteration variable must be of an integral type or a random access iterator, and the <tt>cilk_for</tt> must fit a certain pattern, which permits the number of iterations to be computed before the loop iterations commence.  See the documentation <a href="http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/index.htm#cref_cls/common/cilk_for.htm">here</a> for details on the limitations.</p>
<p>Another key feature of Intel(R) Cilk Plus are hyperobjects.  They deal the "join" part of fork-join parallelism when there is data to be joined, such as in reductions.  Hyperobjects are a topic for another day.</p>
<h2>Example</h2>
<p>If you've read my other recent blogs, you know I'll use  <a href="http://home.comcast.net/~arch.robison/seismic_duck.html">Seismic Duck</a> as the example.  Two parts of Seismic Duck lend themselves to fork-join parallelism. The first is the three physical models that can be updated in parallel:</p>
<ul>
<li>wavefield propagation</li>
<li>seismogram</li>
<li>reservoir</li>
</ul>
<p>This parallelism can be expressed with <tt>cilk_spawn</tt> and <tt>cilk_sync</tt>.  Here is the code from the actual source:</p>
<pre name="code" class="cpp">WavefieldFunctor wf(subsurface, pausedRequest);
cilk_spawn wf();
SeismogramFunctor sf(seismogramClip, pausedRequest&amp;NimbleDraw);
cilk_spawn sf();
ReservoirFunctor rf(pausedRequest);
rf();
cilk_sync;</pre>
<p>Two of the functors turn out to be just wrappers around function calls.  They were wrapped because I was using TBB.  I really don't need the wrappers  in Cilk.  For example, <tt>SeismogramFunctor</tt> looks like:</p>
<pre name="code" class="cpp">class SeismogramFunctor {
    const NimbleRequest request;
    NimblePixMap seismogramClip;
public:
    SeismogramFunctor( const NimblePixMap&amp; seismogramClip_, NimbleRequest request_ ) :
        seismogramClip(seismogramClip_),
        request(request_)
     {}
     void operator()() const {
         SeismogramUpdateDraw( seismogramClip, request, TheColorFunc, IsAutoGainOn );
     }
};</pre>
<p>so instead of writing that class definition and the lines:</p>
<pre name="code" class="cpp">    SeismogramFunctor sf(seismogramClip, pausedRequest&amp;NimbleDraw);
    cilk_spawn sf();</pre>
<p>I could have discarded the wrapping and just written:</p>
<pre name="code" class="cpp">    cilk_spawn SeismogramUpdateDraw( seismogramClip, request, TheColorFunc, IsAutoGainOn );</pre>
<p>That's short and sweet.</p>
<p>The second part that can be profitably parallelized is wavefield propagation, by using geometric decomposition and the ghost cell pattern, as described in another <a href="http://software.intel.com/en-us/blogs/2010/05/09/parallel-patterns-in-seismic-duck-part-2-thread-parallelism/">blog</a>.  There are two phases:</p>
<ul>
<li>Exchange boundary information between panels.</li>
<li>Update each panel independently.</li>
</ul>
<p>The second phase dominates execution time, so that's the part I'll parallelize.  The parallel code is:</p>
<pre name="code" class="cpp">cilk_for( int p=0; p&lt;NumPanel; ++p ) {
    if( request&amp;NimbleUpdate )
        WavefieldUpdatePanel( p );
    if( request&amp;NimbleDraw )
        WavefieldDrawPanel( p, map );
}</pre>
<p>All I had to do was change the keyword <tt>for</tt> to <tt>cilk_for</tt>.  As I mentioned earlier, to fully utilize the processor, we also need SIMD parallelism.  The panel updates are further divided into sequences of tile updates that improve cache reuse.  <a href="http://software.intel.com/en-us/blogs/2010/09/03/simd-parallelism-using-array-notation/">Part 1</a> explained how a tile updatecan be expressed concisely with Array Notation .</p>
<h2>Serial Semantics</h2>
<p>It's no accident that the parallel code is almost identical to the original serial code.  Cilk is essentially annotations on a serial program that say where parallelism is permitted.  In fact, your code can still compile with non-Cilk compilers, albeit without the parallelism. Just use the following defines to undo the annotations:</p>
<pre name="code" class="cpp">#define cilk_spawn
#define cilk_sync
#define cilk_for for</pre>
<p>Code compiled this way behaves exactly the way the parallel program behaves if limited to a single hardware thread.  I hope that eventually all C/C++ compilers recognize Cilk notation so that the <tt>#define</tt> trick will not be necessary and Cilk parallelism becomes ubiquitous.</p>
<h2>Summary</h2>
<p>Cilk notation makes fork-join parallelism easy.  Array Notation makes SIMD parallelism easy.  Intel(R) Cilk Plus provides both.  Combined, they are a powerful pair of tools for squeezing high performance from modern microprocessors.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/09/13/thread-parallelism-using-cilk-notation-for-cc/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIMD Parallelism using Array Notation</title>
		<link>http://software.intel.com/en-us/blogs/2010/09/03/simd-parallelism-using-array-notation/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/09/03/simd-parallelism-using-array-notation/#comments</comments>
		<pubDate>Fri, 03 Sep 2010 14:56:43 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[array notation]]></category>
		<category><![CDATA[Cilk]]></category>
		<category><![CDATA[Cilk Plus]]></category>
		<category><![CDATA[Seismic Duck]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/09/03/simd-parallelism-using-array-notation/</guid>
		<description><![CDATA[C++ programmers need not envy APL and Fortran90 anymore.]]></description>
			<content:encoded><![CDATA[<p>Are you a C or C++ programmer who has ever envied APL or Fortran 90's array expressions?   Read on.  If you don't know what array expressions are, then you really should read on, to find out what you should have envied.  In any case, the envy is over, because  Intel Parallel Composer 2011 brings array expressions to C and C++.</p>
<h2>Background</h2>
<p>A while back I wrote about the <a href="http://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_8.pdf">Three Layer Cake</a> pattern for parallel programming.  The pattern is a way of organizing programs to fully exploit modern  multi-core chips.   Two of the layers are:</p>
<ul>
<li><strong>fork-join</strong>: harnesses multiple hardware threads.</li>
<li><strong>SIMD</strong>: harnesses SIMD instructions.</li>
</ul>
<p>The compiler in Intel Parallel Composer 2011 extends C++ to directly support these two layers.  The extensions are called Intel(R) Cilk Plus.  They are:</p>
<ul>
<li>Cilk notation for specifying <strong>fork-join</strong> parallelism.</li>
<li>Array notation for specifying <strong>SIMD</strong> parallelism.</li>
</ul>
<p>This blog introduces the array notation, with a <a href="http://home.comcast.net/~arch.robison/seismic_duck.html">Seismic Duck</a> kernel as the example.  I'll introduce Cilk notation in another blog.  The two notations are independent.  Indeed, the array notation is valuable with other threading packages too, such as Threading Building Blocks, or just for writing faster serial code.</p>
<h2>Quick Introduction to Array Notation</h2>
<p>The array notation extension is reminiscent of APL and Fortran-90 style array expressions.   The expression:</p>
<p style="padding-left: 30px">a[<em>index</em>:<em>count</em>]</p>
<p>denotes an <em>array section</em> starting at <em>index </em>with <em>count</em> elements.  Scalar operations can be used on conformable array sections in an intuitive manner.   Operations between scalars and array sections work too; scalar extende in the obvious way (like in APL or Fortran 90).  Examples:  </p>
<pre name="code" class="cpp">    z[i:n] = x[i:n];      // Copies x[i..i+n-1] to x[i..i+n-1].
    z[i:n] = 2*x[i+1:n];  // Sets z[i..i+n-1] to twice the corresponding elements in x[i+1..i+n].
    u[i:m][j:n] += 1;     // Increments elements of two-dimensional mxn array section with upper left corner [i][j].</pre>
<p>Section notation also permits expressions of the form <em>array</em>[<em>index:count:stride</em>], reductions, and shorthands that I will not into here.   I'm presenting just enough to pique your interest.  To learn more about it, follow <a href="http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/hh_goto.htm#optaps/common/optaps_par_cean_prog.htm">this link</a> to the compiler documentation.   </p>
<h2>Example</h2>
<p>I've described in other blogs how seismic wave propagation in Seismic Duck depends on <a href="http://software.intel.com/en-us/blogs/2010/08/07/parallel-patterns-in-seismic-duck-part-3-vectorization-and-tiling/">updating a "tile"</a>, a small subarray that fits in cache.  Here is the scalar code that dominates execution time.  It updates a tile with uniform A and B coefficients:</p>
<pre name="code" class="cpp">    float a = 2*A[iFirst][jFirst];
    float b = B[iFirst][jFirst];
    for( int i=iFirst; i&lt;iLast; ++i ) {
        for( int j=jFirst; j&lt;jLast; ++j ) {
            Vx[i][j] += a*(U[i][j+1]-U[i][j]);
            Vy[i][j] += a*(U[i+1][j]-U[i][j]);
            U[i][j] += b*((Vx[i][j]-Vx[i][j-1])+(Vy[i][j]-Vy[i-1][j]));
        }
}</pre>
<p>To improve the speed on compilers that did not automatically generate SIMD code from the scalar loops, I wrote the key loops with SSE intrinsics, so that calculations are done four wide instead of one at a time.  The resulting code looks like this:</p>
<pre name="code" class="cpp">    #define CAST(x) (*(__m128*)&amp;(x))        /* for aligned load or store */
    #define LOAD(x) _mm_loadu_ps(&amp;(x))      /* for unaligned load */
    #define ADD _mm_add_ps
    #define MUL _mm_mul_ps
    #define SUB _mm_sub_ps
    ...
    __m128 a = CAST(A[iFirst][jFirst]);
    a = ADD(a,a);
    __m128 b = CAST(B[iFirst][jFirst]);
    for( int i=iFirst; i&lt;iLast; ++i ) {
        for( int j=jFirst; j&lt;jLast; j+=4 ) {
            __m128 u = CAST(U[i][j]);
            CAST(Vx[i][j]) = ADD(CAST(Vx[i][j]),MUL(a,SUB(LOAD(U[i][j+1]),u)));
            CAST(Vy[i][j]) = ADD(CAST(Vy[i][j]),MUL(a,SUB(CAST(U[i+1][j]),u)));
            CAST(U[i][j]) = ADD(u,MUL(b,ADD(SUB(CAST(Vx[i][j]),LOAD(Vx[i][j-1])),SUB(CAST(Vy[i][j]),CAST(Vy[i-1][j])))));
        }
    }</pre>
<p>The downside of the change is obvious - it's hard to read. And this was a <em>simple</em> case because logic elsewhere guarantees that <tt>jLast-jFirst</tt> is a multiple of 4.  Otherwise, dealing with the extra iterations would have further obfuscated the code.</p>
<p>For this particular example, explicit SSE intrinsics are not actually necessary with a compiler that automatically vectorizes (convert to SIMD instructions).  Indeed, recent compilers that I tried seem to be able to do so.  (Though one older compiler from 2008 did not.)   But I was careful to cater to the optimizer.  I declared the arrays Vx, Vy, and U as static file-scope arrays in the source code, not pointers.  That's not trendy OO programming, but it lets the compiler easily prove absence of aliasing, and thus absence of loop carried dependences that could thwart vectorization.   It's not always practical to cater this way to the optimizer.  Furthermore, array notation has its own elegance.  So I'll use the kernel as a running example anyway. </p>
<p>The array notation in Intel(R) Cilk Plus lets me state my intent ("SIMD parallelism!") to the compiler more bluntly.  Below is an array notation version of the example:</p>
<pre name="code" class="cpp">   int i = iFirst;
   int j = jFirst;
   size_t m = iLast-iFirst;
   size_t n = jLast-jFirst;
   float a = 2*A[i][j];
   float b = B[i][j];
   Vx[i:m][j:n] += a*(U[i:m][j+1:n]-U[i:m][j:n]);
   Vy[i:m][j:n] += a*(U[i+1:m][j:n]-U[i:m][j:n]);
   U[i:m][j:n] += b*((Vx[i:m][j:n]-Vx[i:m][j-1:n])+(Vy[i:m][j:n]-Vy[i-1:m][j:n]));</pre>
<p>Compare the last three lines with the forall loops from which the code was derived in <a href="http://software.intel.com/en-us/blogs/2010/04/21/parallelism-patterns-in-seismic-duck-part-1-background/">another blog</a>:</p>
<pre name="code" class="cpp">    forall i, j {
        Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
        Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
    }
    forall i, j {
        U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
    }</pre>
<p>The array notation has let me clearly convey the parallel nature of the updates.  I had to add the setup of i, j, m, n.  But that's an accident of history.  Elsewhere the code computes { iFirst, iLast, jFirst, jLast} from the equivalent of {i, j, m, n} because the former simplified writing the C++ for loops.  If I adapt the rest of the code to use array notation, then { iFirst, iLast, jFirst, jLast} will disappear and {i,j,m,n} will be setup in their place.</p>
<p>Of course I'm still depending on a clever optimizer to eliminate the temporary subarrays.  For example, the subexpression:</p>
<pre name="code" class="cpp">    a*(U[i:m][j+1:n]-U[i:m][j:n]);</pre>
<p>conceptually generates two temporary array sections, for the results of - and *.   In practice, the Intel compiler is good at eliminating those temporaries.   (It's had years of practice doing so for Fortran 90.)  But even if some temporary sections remained, the <em>parallelism</em> is still clear.  The compiler does not have to deduce parallelism from dependence analysis of serial <tt>for</tt> loops. </p>
<p>Concise notation is nice, but how about the performance?  When compiled by the Intel compiler and run on the Core-2 Quad system in my office, the array notation variant performed <em>faster</em> than my hand-coded SSE.   [Your mileage may vary.  See optimization disclaimer <a href="http://software.intel.com/en-us/articles/intel-parallel-studio-2011-documentation/#studio">here</a>.]  I dug through the object code to figure out why.  It turns out that updating Vx, Vy, and U in separate loops does better than with a single loop.   I found out that the hand-coded SSE does as well if changed to use separate loops to update the three arrays.  Anyway, I'm happy that the array notation matches the best that I can do by hand for this example.</p>
<h2>Summary</h2>
<p>The array notation is a concise way to express SIMD parallelism.  I'm hoping it catches on with other compilers.  In <a href="http://software.intel.com/en-us/blogs/2010/09/13/thread-parallelism-using-cilk-notation-for-cc/">another blog</a> I'll introduce the application of Cilk fork-join parallelism to Seismic Duck.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/09/03/simd-parallelism-using-array-notation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Seismic Duck goes Open Source</title>
		<link>http://software.intel.com/en-us/blogs/2010/08/28/seismic-duck-goes-open-source/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/08/28/seismic-duck-goes-open-source/#comments</comments>
		<pubDate>Sun, 29 Aug 2010 03:42:29 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/08/28/seismic-duck-goes-open-source/</guid>
		<description><![CDATA[Seismic Duck ]]></description>
			<content:encoded><![CDATA[<p>Now you can read the source code for my <a href="http://home.comcast.net/~arch.robison/seismic_duck.html">Seismic Duck</a> game on <a href="https://sourceforge.net/projects/seismic-duck/">Source Forge</a>.  I open-sourced the code for several reasons:</p>
<ul>
<li>My <a href="http://software.intel.com/en-us/blogs/2010/04/21/parallelism-patterns-in-seismic-duck-part-1-background">blogs</a> on parallelizing it with SSE and TBB omit details of interest.  The blogs chiefly concern the seismic wave propagation code in <a href="https://seismic-duck.svn.sourceforge.net/svnroot/seismic-duck/trunk/Source/Wavefield.cpp">Source/Wavefield.cpp</a> .</li>
<li>Games about reflection seismology are not runaway best sellers.</li>
<li>It's limited to Windows.  I'd like to find volunteers to port it to other platforms.   Mac OS is of particular interest, since it is common in educational settings.   The OS-specific parts are about 600 lines of C++.</li>
</ul>
<p>The code is my own hobby, not Intel's.   As such, comments are  sparse.  I'll expand them as questions arise.  If you are interested in porting it to new platforms, please contact me and I'll help as best I can.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/08/28/seismic-duck-goes-open-source/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallel Patterns in Seismic Duck – Part 3 (Vectorization and Tiling)</title>
		<link>http://software.intel.com/en-us/blogs/2010/08/07/parallel-patterns-in-seismic-duck-part-3-vectorization-and-tiling/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/08/07/parallel-patterns-in-seismic-duck-part-3-vectorization-and-tiling/#comments</comments>
		<pubDate>Sun, 08 Aug 2010 05:08:30 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/08/07/parallel-patterns-in-seismic-duck-part-3-vectorization-and-tiling/</guid>
		<description><![CDATA[How vectorization, tiling, and wide halo patterns speed up the Seismic Duck kernel.]]></description>
			<content:encoded><![CDATA[<p>This blog is part three of four about how I applied parallel patterns to <a href="http://arch.robison.home.comcast.net/seismic_duck.html">Seismic Duck</a>.  <a href="http://software.intel.com/en-us/blogs/2010/04/21/parallelism-patterns-in-seismic-duck-part-1-background">Part 1</a> covered background material and overall parallelization strategy for the game.  <a href="http://software.intel.com/en-us/blogs/???">Part 2</a> covered threading.  This part covers vectorization, tiling, and wide halo. Part 4 covers bookkeeping details in the real code.</p>
<h2>Vectorization</h2>
<p>I'll describe vectorization first.  The kernel of interest is:</p>
<pre>        for( i=start[k]; i&lt;finish[k]; ++i )     // Loop over rows in chunk k
            for( j=1; j&lt;n; ++j ) {
                Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
                Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
                U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
            }</pre>
<p>A Core-2 core can generally operate on 4-element vectors just as fast as it can operate on scalars.  Thus  scalar code wastes 3/4 of the machine's capability.  The kernel is trivial to vectorize.  Let X[i:4] denote a four-element SSE vector, starting at subscript i.  In other words, X[i:4] is the quadruple (X[i], X[i+1], X[i+2], X[i+3]). To vectorizing this kernel:</p>
<ul>
<li>Replace the scalar operations with SSE vector operations</li>
<li>adjusting the stride of the j loop index to loop over SSE vectors instead of scalars</li>
</ul>
<p>The vectorized code looks like this conceptually:</p>
<pre>        for( i=start[k]; i&lt;finish[k]; ++i )     // Loop over rows in chunk k
            for( j=1; j&lt;n; j+=4 ) {
                Vx[i][j:4] += (A[i][j+1:4]+A[i][j:4])*(U[i][j+1:4]-U[i][j:4]);
                Vy[i][j:4] += (A[i+1][j:4]+A[i][j:4])*(U[i+1][j:4]-U[i][j:4]);
                U[i][j:4] += B[i][j:4]*((Vx[i][j:4]-Vx[i][j-1:4]) + (Vy[i][j:4]-Vy[i-1][j:4]));
            }</pre>
<p>The vectorization is legal because afterwards, the updates of Vx and Vy still use the old values of U, and the updates of U still use the new values of V. Note that allowing multiple iterations to run concurrently using threading would not be safe, as explained in <a href="http://software.intel.com/en-us/blogs/2010/05/09/parallel-patterns-in-seismic-duck-part-2-thread-parallelism">part 2</a>. The loop thus demonstrates the the point that not all vectorizable loops can be multithreaded, at least without further transformations, e.g. the ghost cell pattern covered in part 2.</p>
<p>The beta version of the Intel 12.0 C++ compiler actually supports the [j:4] array syntax. It's really convenient to use. Alas I was developing Seismic Duck without that support, so I resorted to SSE intrinsics and macros. The macros look like this:</p>
<pre>        #define CAST(x) (*(__m128*)&amp;(x))    /* for aligned load or store */
        #define LOAD(x) _mm_loadu_ps(&amp;(x))    /* for unaligned load */
        #define ADD _mm_add_ps
        #define MUL _mm_mul_ps
	#define SUB _mm_sub_ps</pre>
<p>With these macros the loop body looks like:</p>
<pre>        __m128 u = CAST(U[i][j]);
        __m128 a = CAST(A[i][j]);
        CAST(Vx[i][j]) = ADD(CAST(Vx[i][j]),MUL(ADD(LOAD(A[i][j+1]),a),SUB(LOAD(U[i][j+1]),u)));
        CAST(Vy[i][j]) = ADD(CAST(Vy[i][j]),MUL(ADD(CAST(A[i+1][j]),a),SUB(CAST(U[i+1][j]),u)));
	CAST(U[i][j]) = ADD(u,MUL(CAST(B[i][j]),ADD(SUB(CAST(Vx[i][j]),LOAD(Vx[i][j-1])),SUB(CAST(Vy[i][j]),CAST(Vy[i-1][j])))));</pre>
<p>It's efficient ugly non-portable vector code.  I'm hoping that other compilers adopt the array syntax so I can write efficient beautiful portable vector code.</p>
<h2>Tiling</h2>
<p>Vectorization alone does not improve performance of the kernel.   It slams it harder against the "memory wall".   The "memory wall" refers to the limit on memory bandwidth.  Reducing consumption of memory bandwidth becomes critical.  Parts 1 and 2 discussed the importance of the ratio C (Compute density).  It is defined as the ratio of the number of floating-point additions per memory reference.  Now I'll describe how raise the value of C multifold with tiling.</p>
<p>The kernel as shown so far updates the wavefield for a single time step. Now it's time to bring in more context. The kernel occurs inside a loop that steps over time, conceptually like this:</p>
<pre>    for( t=0; t&lt;T; ++t ) {  // T = number of time steps per video frame (typically 3)
        replicate borders
        for each chunk k in parallel {
            for( i=start[k]; i&lt;finish[k]; ++i ) // Loop over rows in chunk k
                for( j=1; j&lt;n; j+=4 ) {
                    Vx[i][j:4] += (A[i][j+1:4]+A[i][j:4])*(U[i][j+1:4]-U[i][j:4]);
                    Vy[i][j:4] += (A[i+1][j:4]+A[i][j:4])*(U[i+1][j:4]-U[i][j:4]);
                    U[i][j:4] += B[i][j:4]*((Vx[i][j:4]-Vx[i][j-1:4]) + (Vy[i][j:4]-Vy[i-1][j:4]));
                }</pre>
<p>With the loop structure above, each iteration of the t loop has to reload Vx, Vy, and U from memory.  If the t loop could be changed from being the <em>outermost</em> loop to being the <em>innermost </em>loop, the reloads could be avoided, because each iteration would reuse values cached by the previous iteration. But naively permuting the loops delivers wrong answers, because updates would use incorrect values.</p>
<p>Ignore the "for each chunk" loop for now.  I'll come back to that in the next section.  For now, consider the other three loops. They are an abstract loop over points in space-time with coordinates (t,i,j).  The constraints are that point (t,i,j) must be updated:</p>
<ul>
<li>after update of point (t-1,i,j) (required to preserve dependencies across time)</li>
<li>after update of points (t,i-1,j) and (t,i,j-1) (required to update U correctly)</li>
<li>before update of point (t,i+1,j) and (t,i,j+1) (required to update Vx and Vy correctly)</li>
</ul>
<p>The tiling will update the points in groups. Each group is a rectangular subset of points, called a <em>tile</em>.  A tile is updated using two nested loops as shown in the vectorized code, but i and j range over the tile instead of the entire spatial grid. An additional outer loop sequences the tiles.</p>
<p>Picture the kernel as a problem in tiling a floor. The floor runs along the i and j axes.  The t axis is vertical to the floor.  The floor must be tiled T levels deep for each video frame.  T is typically 3 in Seismic Duck. Each tile of size 1×7×112.  That is, each tile is:</p>
<ul>
<li>1 unit thick along the t axis,</li>
<li>7 units thick along the i axis, and</li>
<li>112 units along the j axis.</li>
</ul>
<p>Why such a long skinny tile and where did the 112 come from? The answer is that the tiles are actually square, when measured in <em>cache lines</em>. It's cache lines that count, because they are the quantum of information transferred between memory and the CPU. A cache line is typically 64 bytes on my intended target hardware, and I'm using single-precision floating-point.  So that's 16 floats per line.  The 7×7 dimension in cache lines works out to be about the right size.  I chose the size by a back of the envelope calculation (based on the size of the L1 cache) and then did some experiments.</p>
<p>Tiles can be cut into smaller tiles if necessary, particularly when working around the edges of the grid.  Part 4 will way more about this.  The previous constraints on the order of updating points apply similarly to tiles. A tile can be laid at level t only if:</p>
<ul>
<li>all the area under it is tiled to level t-1,</li>
<li>its north and west sides will not be exposed,</li>
<li>its south and east sides will be exposed.</li>
</ul>
<p>Another way to word the constraints is from the viewpoint of an ant crawling over the tiles:</p>
<ul>
<li>No overhanging tiles allowed.</li>
<li>If an ant crawls north or west, it must never go down.</li>
<li>If an ant crawls south or east, it must never go up.</li>
</ul>
<p>Below is a picture showing both legal and illegal partial tilings. The three levels of tiles are distinguished by different colors.<br />
<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/08/Tiling4.png"><img class="aligncenter size-full wp-image-17536" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/08/Tiling4.png" alt="" width="602" height="290" /></a></p>
<p>The red X marks tiles that violate one or more of the constraints.</p>
<p>There is one more thing to model: the cache hardware. Laying down a tile corresponds to performing a calculation on data that must be in cache.  Each tile is a 7x7 grid of cache lines.  Call a line hot if it is in cache, and cold if it is out of cache.  If a cache line is not touched, it cools until it becomes cold. A cache line can be reheated (brought back into cache), but that takes precious time. These effects add a constraint to the tiling exercise, because the computation of points in a tile depend on nearby points in space. When laying a tile, the area under and close to the new tile must be hot.</p>
<p>Laying a complete layer before doing the next layer is inefficient because each time a tile is laid, the area until it has grown cold.  It's far better to lay a tile on top of a hot tile, ideally the one laid immediately earlier.  A good pattern is to lay each tile almost on top of the previous tile, but shifted north and west one unit to meet the aforementioned "ant constraints", until the stack of offset tiles is T levels deep. Then lay another stack immediately east of this stack. Once a row of stacks is completed, do the next row.</p>
<p>The pattern looks tricky around the edges of the floor, but there is an easy way to deal with that. Imagine that we do not care if a tile goes beyond the floor boundary, and the overhang constraint does not apply to beyond the floor. Then it's just a matter of starting the pattern beyond the floor, and cutting off portions of a tile that extend beyond the floor.</p>
<p>Below is an animation of the tiling sequence for a small 23x16 floor with 7x7 tiles.<br />
<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/08/tiling-animation1.gif"><img class="aligncenter size-full wp-image-17539" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/08/tiling-animation1.gif" alt="" width="232" height="162" /></a><br />
Note: I wrote the animation by hand using GIMP, frame by frame. I now greatly appreciate the effort artists put into hand-drawn animations!</p>
<p>With the typical depth of 3, the tiling pattern almost triples the compute density.  I say "almost" because I may not be getting perfect reuse around the edges.   But it's close enough.   Some programs sequence the stacks more elaborately, such as zig-zagging diagonally or recursively, in order to exploit multiple levels of cache. The simple pattern I used works well for Seismic Duck because the outer level cache (e.g. L2 cache on a my system) is big enough to hold the cache lines that are reused between rows of stacks. T</p>
<p>The final loop structure looks like:</p>
<pre>    for each video frame do {
        replicate chunk borders widely // See next section
	for each chunk k in parallel {
	    for each tile z in chunk k, in tiling sequence order, do
		for each i in z do
		   for each j in z do {
                       Vx[i][j:4] += (A[i][j+1:4]+A[i][j:4])*(U[i][j+1:4]-U[i][j:4]);
                       Vy[i][j:4] += (A[i+1][j:4]+A[i][j:4])*(U[i+1][j:4]-U[i][j:4]);
                       U[i][j:4] += B[i][j:4]*((Vx[i][j:4]-Vx[i][j-1:4]) + (Vy[i][j:4]-Vy[i-1][j:4]));
		    }</pre>
<p>Part 4 will say more about implementing the tiling sequence.</p>
<h2>Wide Halo</h2>
<p>Note that I added the word "widely" to the replication step. Because the tiling sequence advances a chunk T timesteps instead of one, it is not enough to copy just the immediate border points. Instead, a T-point wide border has to be copied, and the space to be tiled is not a rectangular prism in (t,i,j) space, but instead is a frustum with a rectangular base in the (0,*,*) plane, and walls that slope 45 degrees. The top surface of the frustum in the (T-1,*,*) plane.  The top surface is the chunk's contribution to the final computation. The bases overlap by a region T points wide. The overlapping portions of the frustums correspond to redundant calculations.  See the section titled "Wide Halo" of the <a href="http://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_6.pdf">Ghost Cell Pattern</a> paper for more on this technique.</p>
<h2>Summary of Vectorization and Cache Optimizations</h2>
<p>Vectorization quadruples the calculations per instruction, making memory bandwidth the bottleneck. Tiling in space-time reduces consumption of memory bandwidth. Doing so requires the wide halo pattern.</p>
<p>The tiling involves some complex bookkeeping.  Boundary condition physics makes it even more complicated.  Part 4 will describe how I dealt with the bookkeeping without going mad.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/08/07/parallel-patterns-in-seismic-duck-part-3-vectorization-and-tiling/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallel Patterns in Seismic Duck – Part 2 (Thread Parallelism)</title>
		<link>http://software.intel.com/en-us/blogs/2010/05/09/parallel-patterns-in-seismic-duck-part-2-thread-parallelism/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/05/09/parallel-patterns-in-seismic-duck-part-2-thread-parallelism/#comments</comments>
		<pubDate>Sun, 09 May 2010 22:10:57 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/05/09/parallel-patterns-in-seismic-duck-part-2-thread-parallelism/</guid>
		<description><![CDATA[This blog is part two of four about how I applied parallel patterns to Seismic Duck, the most exciting game you will ever find about reflection seismology. I admit it’s probably the only game about reflection seismology. Part 1 covered background material and overall parallelization strategy for the game. This part covers threading. Part 3 [...]]]></description>
			<content:encoded><![CDATA[<p>This blog is part two of four about how I applied parallel patterns to <a href="http://home.comcast.net/~arch.robison/seismic_duck.html">Seismic Duck</a>, the most exciting game you will ever find about reflection seismology. I admit it’s probably the only game about reflection seismology. <a href="http://software.intel.com/en-us/blogs/2010/04/21/parallelism-patterns-in-seismic-duck-part-1-background">Part 1</a> covered background material and overall parallelization strategy for the game. This part covers threading. <a href="http://software.intel.com/en-us/blogs/2010/08/07/parallel-patterns-in-seismic-duck-part-3-vectorization-and-tiling/">Part 3</a> covers vectorization and cache optimizations. Part 4 covers bookkeeping details in the real code.   The game is a bit peculiar, but the parallel patterns are generally useful things to know.</p>
<p>The big advantage of learning parallel patterns is twofold.  First, you learn a way to solve a class of parallel programming problems.  Second, you gain a concise way to describe the solution, because the pattern has a name.   Like a standard recipe, you can point someone unfamiliar with the pattern to paper describing the pattern, instead of having to explain it from scratch each time.</p>
<p><a href="http://software.intel.com/en-us/blogs/2010/04/21/parallelism-patterns-in-seismic-duck-part-1-background">Part 1</a> described the update calculations as arranged in the <a href="http://www.cs.uiuc.edu/homes/snir/PPP/patterns/oddeven.pdf">Odd-Even pattern</a>:</p>
<pre>    forall i, j {
        Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
        Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
    }
    forall i, j {
        U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
    }</pre>
<p><a href="http://software.intel.com/en-us/blogs/2010/04/21/parallelism-patterns-in-seismic-duck-part-1-background">Part 1</a> described how this form is trivial to parallelize, but suffers from doing little computation per memory reference. It defined the ratio C (Compute density) to be the number of floating-point additions per memory reference. Using the Odd-Even pattern for the wave simulation yields C=10/11=.91; the hardware can deliver C=12. In other words, memory bandwidth limits the floating-point unit to about 1/13th of its capability. [Whence the 13th?  12/(10/11)=13.2]</p>
<p>Updating Vx, Vy, and U in a single sweep improves C. Here is a breakdown of the work per grid point for the single sweep method:</p>
<ul>
<li>Single sweep (update Vx, Vy, and U):
<ul>
<li>8 memory references (read A, B, U, Vx, Vy; write Vx, Vy, U).</li>
<li>10 floating-point adds</li>
<li>3 floating-point multiplications</li>
</ul>
</li>
</ul>
<p>Now C=10/8=1.25. That’s about a 37% improvement over the C=.91 value in the original code. Part 3 will show how to <em>triple</em> that value, to reach C=3.75 and even raise it higher. However, the improvement in C complicates parallelism. The single-sweep version fuses the original loops to look like this:</p>
<pre>    for( i=1; i&lt;m; ++i )
      for( j=1; j&lt;n; ++j ) {
        Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
        Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
        U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
      }</pre>
<p>The fused code has a <em>sequential </em>loop nest, because it must preserve the following constraints:</p>
<ul>
<li>The update of Vx[i][j] must use the value of U[i][j+1] from the <em>previous</em> sweep.</li>
<li>The update of Vy[i][j] must use the value of U[i+1][j] from the <em>previous</em> sweep.</li>
<li>The update of U[i][j] must use the values of Vx[i][j-1] and Vy[i-1][j] from the <em>current </em>sweep.</li>
</ul>
<p>Treating the grids as having map coordinates, a grid point must be updated <em>after </em>the points north and west of it are updated, but <em>before </em>the points south and east of it are updated.</p>
<p>One way to parallelize the loop nest is the <a href="http://www.cs.uiuc.edu/homes/snir/PPP/patterns/wavefront.pdf">Wavefront Pattern</a> . In that pattern, execution sweeps diagonally from the northwest to southeast corner. But that pattern has relatively high synchronization costs. Furthermore, in this context, it has poor cache locality because it would tend to schedule adjacent grid points on different processors. Some of these inefficiencies can be ameliorated by aggregating grid points into chunks. But nonetheless it is less attractive than the following alternative.</p>
<p>The alternative is <a href="http://parlab.eecs.berkeley.edu/wiki/patterns/geometric_decomposition">geometric decompositio</a>n and the <a href="http://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_6.pdf">Ghost Cell Pattern</a>. Here is a picture of the geometric decomposition in Seismic Duck:<br />
<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/04/duck-decomp.png"><img class="aligncenter size-full wp-image-15812" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/04/duck-decomp.png" alt="" width="396" height="150" /></a><br />
The picture is approximately to scale. Part 3 explains the rationale for choosing such skinny chunks. (Hint: they are <em>not</em> skinny when you use the right measurement units.) Each chunk can be updated independently by a different thread, except around its border. To see the exception, consider the interaction of chunk 0 and chunk 1. Let i<sub>0</sub> be the index of the <em>last </em>row of chunk 0 and let i<sub>1</sub> be the index of the <em>first </em>row of chunk 1.</p>
<ol>
<li>The update of Vy[i<sub>0</sub>][j] must use the value of U[i<sub>1</sub>][j] from the <em>previous</em> sweep.</li>
<li>The update of U[i<sub>1</sub>][j] must use the value Vy[i<sub>0</sub>][j] from the <em>current </em>sweep.</li>
</ol>
<p>The ghost cell pattern enables the chunks to be updated in parallel. Each chunk becomes a separate grid with an extra row of grid points added above and below it. These extra rows replicate information from the neighboring chunks so that each chunk has a copy of the grid points just beyond it. The copy enables a thread working on chunk 0 can get the value of U[i<sub>1</sub>][j] from the previous sweep even if the thread working on chunk 1 updates its copy of U[i<sub>1</sub>][j]. Likewise the thread working on chunk 1 can update its copy of Vy[i<sub>0</sub>][j] without waiting for the thread working on chunk 0 to update it.</p>
<p>The update logic ends up looking like this:</p>
<pre>    // Replicate borders
    for( k=0; k&lt;number_of_chunks-1; ++k ) {
        copy bottom border of chunk k to top of chunk k+1
        copy top border of chunk k+1 to bottom of chunk k
    }
    // Update chunks in parallel
    forall k in 0..number_of_chunks-1 {
        for( i=start[k]; i&lt;finish[k]; ++i ) // Loop over rows in chunk k
            for( j=1; j&lt;n; ++j ) {
                Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
                Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
                U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
            }
    }</pre>
<p>For more insight into this pattern, see the <a href="http://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_6.pdf">Ghost Cell Pattern</a> paper.  It’s a polished paper on an important pattern for grid-based simulations.</p>
<p>The final part 3 of these blogs will cover the cache optimization patterns, which are critical because memory bandwidth, not computation, is the bottleneck.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/05/09/parallel-patterns-in-seismic-duck-part-2-thread-parallelism/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Transitioning to TBB 3.0</title>
		<link>http://software.intel.com/en-us/blogs/2010/05/04/transitioning-to-tbb-30/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/05/04/transitioning-to-tbb-30/#comments</comments>
		<pubDate>Tue, 04 May 2010 15:29:26 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/05/04/transitioning-to-tbb-30/</guid>
		<description><![CDATA[Unless your code is using pointers to member functions of class <TT>tbb::task</TT>, your TBB 2.2 code should work just as well with TBB 3.0.]]></description>
			<content:encoded><![CDATA[<p>Moving your application from TBB 2.2 to TBB 3.0 should be painless.  We take backward compatibility seriously, and try to avoid breaking changes.  But occasionally we judge the benefit of a break to outweigh its costs.  For TBB 3.0, the breaking changes are few.  It’s unlikely you’ll run into trouble with them.</p>
<p><strong>Breaking Changes</strong></p>
<p>The breaking changes for TBB 3.0 are that the following methods of class <tt>task</tt> are changed to <em>static </em>methods:</p>
<ul>
<li>spawn</li>
<li>destroy</li>
<li>allocate_additional_child_of</li>
</ul>
<p>The changes are unlikely to break existing code, because these methods are almost always used in expressions such as “<em>x</em>.spawn(<em>y</em>)”, where <em>x </em>and <em>y </em>are expressions.  Though spawn is now a static method, the expressions <em>x </em>and <em>y </em>are still evaluated.  The only difference is that the value of expression <em>x </em>is no longer used by method spawn.</p>
<p>A source-level break occurs in code that takes the address of one of the changed methods.  For example:</p>
<pre>     typedef void (task::*ptr)(task&amp;);</pre>
<pre>     ptr p = &amp;task::spawn;</pre>
<p>Such code will now elicit an error message from the compiler. The edit below shows how to fix it:</p>
<pre>     typedef void (<span style="text-decoration: line-through;">task::</span>*ptr)(task&amp;);</pre>
<pre>     ptr p = &amp;task::spawn;</pre>
<p>Evolution motivated the change. The expression “<em>x</em>.spawn(<em>y</em>)” originally meant that task <em>x </em>spawns task<em> y</em>.  The task interface evolved in a way that made “<em>x</em>” unnecessary.  Starting from scratch, we would probably define “<em>y</em>.spawn()” to mean to “spawn y”.  Introducing that form now while retaining “<em>x</em>.spawn(<em>y</em>)” would make TBB confusing to existing users, and we did not want to break code using the old form.  </p>
<p>Making spawn static seems like a good compromise.  Old code still works and new code does not need to supply the superfluous <em>x</em>. The only drawback is that code sometimes must supply a class qualification, as in “task::spawn(y)”.  Our examples in the Tutorial needed no change on this point – the call was always in a context where the task:: prefix was implied.   In another blog, I’ll explain how we made old <em>application binaries </em>still work with the new library, even though the signatures have changed.  </p>
<p>The only other theoretically breaking change is that the functor parameters to “parallel_invoke” and “parallel_for_each” are now passed by const reference instead of by value.  We actually made this change in a TBB 2.2 update 2, and did so to be compatible with the corresponding PPL implementations. </p>
<p><strong>Deprecations</strong></p>
<p>As a standards committee maven once said, deprecation is a “warning shot across the bow”.  It is a warning that a feature <em>might </em>disappear in the future.  So far nothing has disappeared in TBB, even 1.0 features.   I still recommend discontinuing use of deprecated features, because the features might disappear in the future.  A feature is usually deprecated because a better alternative exists. </p>
<p>The features newly deprecated in TBB 3.0 are:</p>
<ul>
<li><strong>task::recycle_to_reexecute</strong>.  A call “t-&gt; recycle_to_reexecute();” can be replaced with the sequence:
<pre>          t-&gt;set_ref_count(1);
          t-&gt;recycle_as_safe_continuation();</pre>
</li>
<li><strong>tbb::tbb_thread .  </strong>A use of tbb::tbb_thread can be replaced with TBB’s std::thread.</li>
</ul>
<p>We did not remove the functionality of tbb_thread.  We renamed it std::thread, as in the C++0x draft.  It still has the same interface as the old tbb_thread.  It does not exactly match the C++0x specification, because doing so requires C++0x language features.  But it’s close enough for typical use cases.</p>
<p>We originally chose the name tbb::tbb_thread in TBB 2.1 to avoid name conflicts with vendor’s implementations of class std::thread, particularly in code that is not using namespace qualifiers.  It seemed like the right decision at the time.  But in TBB 3.0 we introduced many more C++0x features (e.g. condition_variable), where the tbb_ prefix was getting irksome. </p>
<p>You get the TBB implementation of std::thread when you include tbb/compat/thread and TBB_IMPLEMENT_CPP0X=1. The default value of TBB_IMPLEMENT_CPP0X is 0 on platforms that implement std::thread. You can override our default definition to flip our implementation on or off.</p>
<p><strong>Summary</strong></p>
<p>Unless your code is using pointers to member functions of class <tt>tbb::task</tt>, your TBB 2.2 code should work just as well with TBB 3.0.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/05/04/transitioning-to-tbb-30/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallelism Patterns in Seismic Duck - Part 1 (Background)</title>
		<link>http://software.intel.com/en-us/blogs/2010/04/21/parallelism-patterns-in-seismic-duck-part-1-background/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/04/21/parallelism-patterns-in-seismic-duck-part-1-background/#comments</comments>
		<pubDate>Wed, 21 Apr 2010 17:54:07 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/04/21/parallelism-patterns-in-seismic-duck-part-1-background/</guid>
		<description><![CDATA[This blog is first of four about applying parallel patterns to a real program.  I'll point out some excellent parallel pattern documents on the Web and an unusual video game of mine.  This blog discusses background material and overall parallelization strategy.  Part 2 covers threading.  Part 3 covers vectorization and cache optimizations.  Part 4 covers [...]]]></description>
			<content:encoded><![CDATA[<p>This blog is first of four about applying parallel patterns to a real program.  I'll point out some excellent parallel pattern documents on the Web and an unusual video game of mine.  This blog discusses background material and overall parallelization strategy.  <a href="http://software.intel.com/en-us/blogs/2010/05/09/parallel-patterns-in-seismic-duck-part-2-thread-parallelism/">Part 2</a> covers threading.  <a href="http://software.intel.com/en-us/blogs/2010/08/07/parallel-patterns-in-seismic-duck-part-3-vectorization-and-tiling/">Part 3</a> covers vectorization and cache optimizations.  Part 4 covers bookkeeping details in the real code.</p>
<p>The game is <a href="http://home.comcast.net/~arch.robison/seismic_duck.html">Seismic Duck</a>.  I recently rewrote it for Windows platforms.  It’s a complete rewrite of something I wrote for Macs in the mid 1990s.  It's a freeware game about reflection seismology, which is imaging underground structures by sending soundwaves into the ground and interpreting the echos. The program teaches the topic by letting the user experiment interactively. It sounds technical, but even children "get it" after playing with it a while.  I’m fond of writing games based on peculiar subjects. For example, <a href="http://home.comcast.net/~arch.robison/frequon.html">Frequon Invaders</a> is a Fourier transform game.  Reflection seismology is a subject I worked on at Shell in the early 1990s.</p>
<p>This series of blogs is about how and why Seismic Duck was parallelized with Intel(R) TBB differently than a related demo in the TBB distribution. Seismic Duck runs three independent core computations:</p>
<ul>
<li>Seismic wave propagation and rendering.</li>
<li>Gas/oil/water flow through a reservoir and rendering.</li>
<li>Seismogram rendering.</li>
</ul>
<p>I run all three in parallel, using <tt>tbb::parallel_invoke</tt>, and vectorize the code where I can. At a hight level, it follows the <a href="http://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_8.pdf">Three Layer Cake pattern</a>. (I'm one of the authors of that paper.) However, that level of parallelism was not enough to get a good animation speed on my machine.  The bottleneck was the wave propagation simulation, which is the focus of these blogs.</p>
<p>Some background on the numerical simulation of waves is necessary.  I used the popular "staggered grid" finite-difference time-domain (<a href="http://en.wikipedia.org/wiki/Finite-difference_time-domain_method">FDTD</a>) method.  There are five 2D arrays, each representing a scalar field over a 2D grid.  Three of the arrays represent variables that step through time.  The other two represent rock properties.  The arrays are:</p>
<ul>
<li>B[i][j]: A constant array with vertex centered values. It’s constant in the sense that it changes only when the subsurface geology changes.</li>
<li>A[i][j]: A constant array with vertex-centered values related to rock properties. The FDTD method actually requires the edge-centered values, but that would double the memory bandwidth for reading it, because for each interior grid point there is a horizontal edged and a vertical edge. As shown later, memory bandwidth, not computation, is the bottleneck. So to reduce memory references, edge-centered values are computed by averaging the two nearest vertex-centered values. Doing so saves memory references because there twice as many edges as vertices. The vertices actually store ½ their physical value, so that each edge value can be computed as the sum of its two endpoint values.</li>
<li>U[i][j]: A variable array with vertex-centered values representing how much the rock is compressed at that point. It is variable in the sense that it changes for every time step.</li>
<li>Vx[i][j], Vy[i][j]: Two variable arrays that represent the x and y component of the point’s velocity. This should not to be confused with the velocity of sound through the rock. These two grids are edge-center. Vx[i][j] represents a physical value between grid points [i][j] and [i][j+1]. Vy[i][j] represents a physical value between grid points points [i][j] and [i+1][j].</li>
</ul>
<p>The picture below shows one square of the grid and how the array elements relate to the scalar fields. Notice how Vx and Vy are edge centered.<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/04/grid.png"><img class="aligncenter size-medium wp-image-15378" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/04/grid-300x264.png" alt="" width="300" height="264" /></a></p>
<p>Array Vx and Vy are not only staggered halfway in space, they are staggered halfway in time.  At the beginning of a time step, Vx and Vy are a half a time step behind U.  The simulation advances Vx and Vy a full step, so that they become a half time step head of U. Then the simulation advances U a full time step. The algorithm is sometimes called “leap frog” because of how the arrays advance over each other in time.</p>
<p>The advantage of staggering and leap frogging is that it delivers results accurate to 2nd order for the cost of a 1st order approach.  The update operations are beautifully simple:</p>
<pre>    forall i, j {
        Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
        Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
    }
    forall i, j {
        U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
    }</pre>
<p>The TBB demo "seismic" and Seismic Duck use two very different approaches to parallelizing these updates.  I wrote the original versions of both.   (<a href="http://software.intel.com/en-us/blogs/author/anton-malakhov/">Anton Malakhov</a> has since made many improvements to my original TBB version.)  The reason for different approaches is different purposes:</p>
<ul>
<li>The TBB demo is supposed to show how to use a TBB feature ("parallel_for") and a basic parallel pattern. I kept it as simple as possible.</li>
<li>Seismic Duck is written for high performance, at the expense of increased complexity. It uses several parallel patterns. It also has more ambitious numerical modeling, notably the use of <a href="http://www-math.mit.edu/~stevenj/18.369/pml.pdf">perfectly matched layers</a> to reduce artificial reflections from the simulation boundaries. (See the URL for a description of this fascinating technique that involves using complex-valued materials that do not exist in the real universe, but can exist in software!)</li>
</ul>
<p>The pattern behind the TBB demo is "<a href="http://www.cs.illinois.edu/~snir/PPP/patterns/oddeven.pdf">Odd-Even Communication Group</a>" in time and a <a href="http://parlab.eecs.berkeley.edu/wiki/patterns/geometric_decomposition">geometric decomposition pattern</a> in space. The code flip-flops between updating the pair of arrays Vx and Vy , and updating array U. (Note to readers of that code: The variable names are S, T, and V instead of Vx, Vy, and U.) Each <tt>forall</tt> can be performed with a <tt>tbb::parallel_for loop</tt>. From a patterns perspective, this is a geometric decomposition pattern. These patterns make a good introduction to parallel programming, and require minimal changes for parallelization.</p>
<p>The big drawback of the Odd-Even pattern is memory bandwidth. When Seismic Duck is played on a typical 21” widescreen monitor, each grid has dimensions 531x1520. (You do not see the entire grid in the game – there is a hidden 16 pixel thick border for the perfectly matched layers.) That works out to about 16 MByte for all 5 grids. High end server have caches that large, but I’m targeting current desktop machines, which have caches in the 1-4 MByte range. Fortunately, a few consecutive rows of each array do fit in cache.</p>
<p>Thus using the Odd-Even pattern, each grid point is loaded once from main memory per time step. Here is a breakdown of work per grid point for each sweep:</p>
<ul>
<li>First sweep (update Vx and Vy):
<ul>
<li>6 memory references (read A, U, Vx, Vy; write Vx, Vy).</li>
<li>6 floating-point adds</li>
<li>2 floating-point multiplications</li>
</ul>
</li>
<li>Second sweep (update U):
<ul>
<li>5 memory references (read B, Vx, Vy, U; write U)</li>
<li>4 floating-point adds</li>
<li>1 floating-point multiplication</li>
</ul>
</li>
</ul>
<p>The multiplications are insignificant because the hardware can overlap them with the additions. The key consideration is the (6+4) floating-point additions per (6+5) memory references. For brevity, from now on I'll call this ratio C (for Compute density). A ratio of C=(6+4)/(6+5)=10/11≈0.91 is a serious bottleneck. My Core-2 Duo system can deliver 8 floating-point additions per clock (two cores each using 4-wide SIMD). But the memory bandwidth limits my system to 2 single-precision memory references every 3 clocks. (I used the <a href="http://www.cs.virginia.edu/stream/">STREAM</a> benchmark to determine this.) So C for the hardware is 8/(2/3) = 12.</p>
<p>To summarize, C=12 for the hardware but C≈0.91 for the Odd-Even code. Thus the odd-even version delivers a small fraction of my machine’s theoretical peak floating-point performance. Phrased another way: forget the floating-point, it's the memory references that matter. Another approach was needed. Part 2 shows how I raised C to 1.25 and multi-threaded the wave simulation using TBB. Part 3 describes how I raised C much higher and vectorized the code.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/04/21/parallelism-patterns-in-seismic-duck-part-1-background/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Implementing task_group interface in TBB</title>
		<link>http://software.intel.com/en-us/blogs/2008/07/02/implementing-task_group-interface-in-tbb/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/07/02/implementing-task_group-interface-in-tbb/#comments</comments>
		<pubDate>Wed, 02 Jul 2008 13:53:06 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/07/02/implementing-task_group-interface-in-tbb/</guid>
		<description><![CDATA[The TBB class task was designed for high-performance implementations of the TBB templates.  It's efficiency, particularly its emphasis on continuation-passing style, comes at some price in convenience.  Rick Molloy of Microsoft has posted a description of a task_group interface that Microsoft is considering.  It's more convenient for than the TBB interface, particularly when your compiler supports C++ [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.threadingbuildingblocks.org">TBB</a> class task was designed for high-performance implementations of the TBB templates.  It's efficiency, particularly its emphasis on continuation-passing style, comes at some price in convenience.  Rick Molloy of Microsoft has <a href="http://blogs.msdn.com/nativeconcurrency/">posted a description</a> of a <code>task_group</code> interface that Microsoft is considering.  It's more convenient for than the TBB interface, particularly when your compiler supports C++ 200x lambda expessions (Section 5.1.1 of <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2606.pdf">N2606</a>).</p>
<p>I implemented a subset of <code>task_group</code> in TBB as a header tbb/task_group.h: 37 lines of C++ and 5 preprocessor lines.   It's a small subset.</p>
<ul>
<li>It does not support task_handle. </li>
<li>The <a href="http://software.intel.com/en-us/blogs/2008/06/11/exception-handling-and-cancellation-in-tbb-part-iv-using-context-objects/">exception/cancellation model</a> is still TBB's. </li>
<li><code>wait()</code> returns void, not <code>task_group_status</code>, since the blog does not detail <code>task_group_status</code>. </li>
</ul>
<p>But nonetheless, I think some TBB users will find this minimal form useful.  For example, it's enough of <code>task_group</code> to write the quicksort in Molloy's post.</p>
<p>The code for header follows my signature.  I'd be interested to hear how useful it is.</p>
<p>- Arch</p>
<pre>#ifndef __TBB_task_group_H
#define __TBB_task_group_H

#include "tbb/task.h"

namespace tbb {

class task_group;

namespace internal {

// Suppress gratuitous warnings from icc 11.0 when lambda expressions are used in instances of function_task.
#pragma warning(disable: 588)

template&lt;typename Function&gt;
class function_task: public task {
    Function my_func;
    /*override*/ task* execute() {
        my_func();
        return NULL;
    }
public:
    function_task( Function&amp; f ) : my_func(f) {}
};

} // namespace internal

class task_group: internal::no_copy {
private:
    empty_task* root;
public:
    task_group() {
        root = new(task::allocate_root()) empty_task;
        root-&gt;set_ref_count(1);
    }
    ~task_group() {
        if( root-&gt;ref_count() )
            root-&gt;wait_for_all();
        root-&gt;destroy(*root);
    }
    template&lt;typename Function&gt;
    void run( Function f ) {
        task&amp; self = task::self();
        self.spawn(*new( self.allocate_additional_child_of( *root )) internal::function_task&lt;Function&gt;(f) );
    }
    void wait() {
        root-&gt;wait_for_all();
    }
};

} // namespace tbb

#endif /* __TBB_task_group_H */</pre>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/07/02/implementing-task_group-interface-in-tbb/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Tasks for Doing and Threads for Waiting</title>
		<link>http://software.intel.com/en-us/blogs/2008/06/05/tasks-for-doing-and-threads-for-waiting/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/06/05/tasks-for-doing-and-threads-for-waiting/#comments</comments>
		<pubDate>Fri, 06 Jun 2008 03:35:27 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/06/05/tasks-for-doing-and-threads-for-waiting/</guid>
		<description><![CDATA[TBB started out as a task-based framework for parallel programming.  TBB 2.1 adds threads.  This note explains the new threading interface, when to use it, and when to use tasks instead. TBB tasks rely on non-preemptive cooperative scheduling based on work stealing, similar to Cilk. Once the TBB scheduler starts a task on a software [...]]]></description>
			<content:encoded><![CDATA[<p>TBB started out as a task-based framework for parallel programming.  TBB 2.1 adds threads.  This note explains the new threading interface, when to use it, and when to use tasks instead.</p>
<p>TBB tasks rely on non-preemptive cooperative scheduling based on work stealing, similar to Cilk. Once the TBB scheduler starts a task on a software thread, it does not switch to another task except at well-defined points (specifically, while waiting for its child tasks to complete). For compute-bound workloads, this style of scheduling has multiple benefits:</p>
<ul>
<li>Cooperative scheduling has low context switch overhead.</li>
<li>Work stealing has good cache locality and certain space guarantees (see Cilk papers)</li>
<li>Lastly, and most important, actual parallelism can be matched to available parallelism. In task-based programming, the programmer is expected to provide too much parallelism (“parallel slack” in Cilk parlance), and the scheduler will extract just enough parallelism to keep the machine humming, not swamped.</li>
</ul>
<p>But programs are not only about calculation.  Programs also often have to wait on external events.  For sake of timely response, the wait needs to be done by a preemptively scheduled thread that can be scheduled when the event occurs. Of course interrupts or polling sometimes work. Interrupts are a bit of a combination of cooperative and preemptive scheduler.  The cooperative part is using an existing thread, the preemptive part is using it any time. That can be the best of both (preemptive and low overhead) or the worse of both (interrupt handlers are typically constrained on what they are allowed to do).  Polling usually scales poorly – composing two polling components requires composing their polling loops or using separate threads for the two polling loops.</p>
<p>Tasks can block, but doing so has two problems:</p>
<ul>
<li>The underlying thread sits idle until the task unblocks.</li>
<li>Tasks (and their threads) waiting on it to also sit idle.  I.e., blocking propagates along a dependence chain.</li>
</ul>
<p>In an ideal world, the scheduler would fire up other threads to run tasks in the meantime.  To do this efficiently requires user-level scheduling support that is not (at least yet) available in all operating systems targeted by TBB. But even with that support, there is another issue.  K blocked threads consume K stacks. Most of these stacks might be quite small, but current calling conventions require that programmers specify a fixed stack size (or use a default one) that is typically much larger than necessary for the common case. Changing the calling convention to be more like Cilk’s would solve this problem, but calling conventions take a long time to reform.</p>
<p>So in addition to tasks, TBB 2.1 has a class tbb_thread, which is a thin wrapper around a platform’s native thread. The interface is as close to the C++ 200x std::thread as we could make it given the limitations of C++ 1998. In particular:</p>
<ul>
<li>Lack of variadic templates restricts us to a fixed limit on template arguments.</li>
<li>Lack or rvalue references implies slightly more overhead because copy-construction has to be used instead of move construction.</li>
<li>Time is measured in the existing TBB timing interface tick_count::interval_t instead of the templated time interface in C++ 200x. We had to draw the line somewhere on where to stop pulling in C++ 200x.</li>
</ul>
<p>We chose to call it tbb::tbb_thread and not tbb::thread to avoid name collisions when the ISO std::thread becomes available and a program liberally employs “using” directives.</p>
<p>Because tbb::tbb_thread is a thin wrapper around native threads, threads are heavier than tasks. They take longer to create and destroy.  They have associated stacks.  They are preemptively scheduled, so they guarantee concurrency, which is useful when you need it, but comes at the price of oversubscription if misused.  But they can block without impacting other threads or tasks. </p>
<p>So TBB 2.1 has two ways to get things done: tasks and threads. When designing a program, try to separate calculating work from waiting work. Use tasks for calculation and threads for waiting. When a thread needs to do calculations, it can do it with tasks. Avoid having a task block on an external event. Software components doing waiting should call on components doing calculation, not the other way around.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/06/05/tasks-for-doing-and-threads-for-waiting/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Abstracting Thread Local Storage</title>
		<link>http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/#comments</comments>
		<pubDate>Thu, 31 Jan 2008 19:25:35 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/</guid>
		<description><![CDATA[[Disclaimer: I'm sketching possibilities here. There is no commitment from the TBB group to implement any of this.] Threading packages often have some notion of a thread id or thread local storage. The two are equivalent in the sense if given one, you can easily build the other. For example, thread local storage can be [...]]]></description>
			<content:encoded><![CDATA[<p>[Disclaimer: I'm sketching possibilities here. There is no commitment from the TBB group to implement any of this.]</p>
<p>Threading packages often have some notion of a thread id or thread local storage. The two are equivalent in the sense if given one, you can easily build the other. For example, thread local storage can be implemented as a one-to-one map from thread ids to pieces of storage. And vice versa, the address of a variable in thread-local storage can be used as a thread id.</p>
<p>TBB by design has no thread id or thread-local storage. TBB is based on task-based parallelism, where the programmer breaks work up into tasks, and the task scheduler is free to map tasks to hardware threads. Furthermore, our OpenMP run-time group strongly recommended that we avoid explicit thread ids because of problems with nested parallelism and dealing with a dynamically growing or shrinking team of threads. For example, the nature of the OpenMP thread id interface implies that the number of threads in a thread team is fixed for the duration of the team.</p>
<p>However, thread local storage does have its uses. Don Webber <a href="http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30247002/30248017/ShowThread.aspx#30248017">posted</a> an excellent use case for thread local storage, which involves updating a sparse matrix. The problem involves doing many updates of the form *p += value, in parallel, where some updates might update the same location. Assuming that += is commutative and associative, one way to implement this is to have each thread sum its own updates privately, and then merge the sums. As Don notes, the alternative of locking *p on each update is prohibitively expensive. Using an atomic +=, even if available on current hardware, would likewise be prohibitively expensive, because cache lines would ping-pong severely.</p>
<p>I'd like to see TBB extended to provide the power of thread local storage without opening up a Pandora's box of raw thread ids. I think the solution needs to cleanly separate a high-level portion from a low-level portion, like we recently did for cache affinity. (Note: the type task::affinity_id might appear to have opened the box, but did not, because it is a <em>hint</em>, not a commandment.)</p>
<p>TBB's template parallel_reduce in TBB partially deals with the cited use case, because it is lazy about fork/joins. The user defines how to fork/join state information for the reduction. The template recursively decomposes the range and applies the users fork/join operations. The laziness is that fork/join pairs are only used when task stealing occurs. For example, if there are P threads and N leaf subranges, it does not do the obvious N-1 fork operations, but instead does just enough to keep the threads busy. Specifically, it does a fork/join pair only when the two children would be processed by different threads.</p>
<p>However, parallel_reduce is not lazy enough. At the high level, the problem is that parallel_reduce cannot assume that the reduction operation is commutative. For a non-commutative reduction operation, the current implementation is close to optimal (maybe a factor of 2 off in the worse case) with respect to the number of fork/join pairs. If TBB added a reduction template that could assume a commutative reduction operation (e.g.parallel_unordered_reduce), then at most P-1 fork/join pairs would be necessary.</p>
<p>The good thing about using the hypothetical parallel_unordered_reduce instead of exposing thread local storage is that it keeps the abstraction at a high level. Explicitly using thread local storage would introduce irrelevant low-level details. For example, a typical implementation based on thread local storage can be sketched as:</p>
<blockquote>
<pre>forall updates (p,value)  

    do  *p += value // *p points to thread-local partial sum  

for each thread-local partial sum do  

    update global-sum+= thread-local partial sum</pre>
</blockquote>
<p>This level exposes issues such as "where are the thread-local partial sums that were generated in the first loop?" Since threads can come and go during execution of the first loop, iterating across ids of currently running threads is not enough. Some of the partial sums might outlive their threads, or some threads might come into existence after the partial sums were generated.  We'll need a container to hold the partial sums, and a means of iterating over the sums.</p>
<p>The interface for such a container seems straightforward. Define it as a sequence of T that has iterator capability. Add TBB-style ranges too, so that reductions over the container can be done in parallel. Add a special method "mine()" that returns reference to  the element that is owned by the invoking thread. If the element is not present, "mine()" would insert one and default-construct it. </p>
<p>Method mine() would be most likely implemented by hashing a thread id, so it's not going to be cheap, but probably inexpensive enough if the user hoists calls to it. </p>
<p>There is an interesting alternative that weakens guarantees, with the intent of expressing intent at a little higher level. It's somewhat an object-oriented extension of a semaphore that combines the semaphore with the resource it is protecting. It would work as follows. The method "mine() could be replaced by two methods "acquire()" and "release()" [possibly sugared with RAII] such that: </p>
<ul>
<li>"acquire()" would grant access to an instance of T that is not being accessed by any other thread</li>
<li>"release()" would release access</li>
</ul>
<p>This interface permits an implementation to keep the limit the number of "thread local" copies of T to what is actually necessary for concurrency, not what is necessary for one-per-thread. If T is really big, this could be advantageous. There's perhaps an issue with cache affinity.  However, a sufficiently clever implementation could bias towards grabbing the instance of T that the thread most recently had before. Of course, a conforming implementation could just use thread-local storage for each copy of T.</p>
<p>Comments?</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Linked Lists - Incompatible with Parallel Programming?</title>
		<link>http://software.intel.com/en-us/blogs/2007/12/20/linked-lists-incompatible-with-parallel-programming/</link>
		<comments>http://software.intel.com/en-us/blogs/2007/12/20/linked-lists-incompatible-with-parallel-programming/#comments</comments>
		<pubDate>Thu, 20 Dec 2007 23:08:55 +0000</pubDate>
		<dc:creator>Arch Robison (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2007/12/20/linked-lists-incompatible-with-parallel-programming/</guid>
		<description><![CDATA[I've been asked several times why TBB does not have a concurrent list class; i.e., a list that supports concurrent access. The answer is that we'd add one if: We could figure out semantics that are useful for parallel programming and We could implement it reasonably efficiently on current hardware. I usually try to avoid [...]]]></description>
			<content:encoded><![CDATA[<p>I've been asked several times why TBB does not have a concurrent list class; i.e., a list that supports concurrent access. The answer is that we'd add one if:</p>
<ul>
<li>We could figure out semantics that are useful for parallel programming <em>and</em></li>
<li>We could implement it reasonably efficiently on current hardware.</li>
</ul>
<p>I usually try to avoid linked lists even for <em>sequential </em>programming if I'm programming for performance. My reasons are:</p>
<ul>
<li>Linked lists are often misused in ways that create asymptotic slowness. Usually the culprit is searching the list. Yes, similar abuses can occur for other data structures too. But lists seem to attract this abuse.</li>
<li>Linked lists are unfriendly to cache. Adjacent items in a list tend to be scattered in memory. With cache misses costing on the order of 100x a cache hit, this can be a significant performance issue.</li>
</ul>
<p>For parallel programming, add another flaw:</p>
<ul>
<li>Traversing a linked list is inherently serial. [Theorists will point out that traversal <em>can </em>be done in parallel if you have a processor per node in the list. Feel free to buy that many Intel processors -- I own Intel stock.]</li>
</ul>
<p>Two traditional attractions of linked lists are:</p>
<ol>
<li>Linked lists are about the easiest dynamic data structure to write from scratch.</li>
<li>Prepending and appending take O(1) time.</li>
</ol>
<p>But modern polymorphic language like C++ provide dynamic data structures like std::vector and std::deque. You don't have to write them from scratch. Prepending or appending to a deque also takes O(1) time. Appending to a vector takes O(1) amortized time. Amortized time is the time averaged over many append operations.</p>
<p>Here's a speed test you might want to try. Construct a container, append n items, walk the container once, and destroy it. Here's the code:</p>
<pre>template&lt;typename Container&gt;</pre>
<pre>int Iota( int n ) {</pre>
<pre>    Container container;

    for( int i=0; i&lt;n; ++i )

        container.push_back(i);

    int sum = 0;

    for( typename Container::const_iterator j=container.begin(); j!=container.end(); ++j )

        sum += *j;

    return sum;

}</pre>
<p>I tried this fragment on a Linux box and found that std::deque was slightly faster than std::list when n&gt;=3, and std::vector was slightly faster when n&gt;=10. When n&gt;=100, std::deque was more than <em>10x </em>faster than std::list, and even std::vector more than 3x faster than std::list. So for very short collections, std::list might pay off. But for big collections, its second-rate.</p>
<p>Of course linked lists do have some virtues, notably when concatentating lists, splicing lists, and inserting items in the middle. I use lists when I need to do that. But getting back to parallelism, which set of those operations make any sense in parallel programming? Concurrent splicing and inserting seems awfully tricky to use correctly. For example, if I really need to insert in the middle of the list, it must be because there is something special about the insertion context. But if there are other threads inserting at the same time, how do I know the context will not be broken?</p>
<p>The two operations on lists that I think could be useful in parallel programming are:</p>
<ol>
<li>concatenating two lists, in constant time</li>
<li>splitting a list into two sublists, or at least view it as two sublists, in constant time</li>
</ol>
<p>For example, a parallel reduction could use "concatenate" as its reduction operation, and thus build a list of N items in O(N/P+log(P)) time. The log(P) term arises from a tree reduction at the end. The problem is the second operation. To keep a list from becoming a serial bottleneck, we need a way to traverse it in parallel. That probably means it is no longer a linked list, but some kind of (balanced?) tree structure.</p>
<p>I've had a recurring thought that we should add this kind list, one that supports concatenation and splitting in constant time. But we really need motivating use cases before implementing it. Suggestions for good use cases or demos appreciated.</p>
<p>- Arch Robison</p>
<p>P.S. I though about writing this blog as a politcal attack ad, but given the technical details, decided against it. If I had done it that way, it would have started:</p>
<blockquote><p>Mr. Linked List is running for office. He's popular everywhere. But here's what Mr. Linked List doesn't want <em>you</em> to know... .</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2007/12/20/linked-lists-incompatible-with-parallel-programming/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
	</channel>
</rss>

