<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blogs &#187; TBB</title>
	<atom:link href="http://software.intel.com/en-us/blogs/tag/tbb/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<lastBuildDate>Fri, 25 May 2012 22:49:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Deterministic Reduction: a new Community Preview Feature in Intel® Threading Building Blocks</title>
		<link>http://software.intel.com/en-us/blogs/2012/05/11/deterministic-reduction-a-new-community-preview-feature-in-intel-threading-building-blocks/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/05/11/deterministic-reduction-a-new-community-preview-feature-in-intel-threading-building-blocks/#comments</comments>
		<pubDate>Fri, 11 May 2012 10:22:42 +0000</pubDate>
		<dc:creator>Alexei Katranov (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Server]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Computer Arithmetic]]></category>
		<category><![CDATA[deterministic calculations]]></category>
		<category><![CDATA[floating point]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[parallel_deterministic_reduce]]></category>
		<category><![CDATA[parallel_reduce]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/05/11/deterministic-reduction-a-new-community-preview-feature-in-intel-threading-building-blocks/</guid>
		<description><![CDATA[Computer Arithmetic has a lot of peculiarities [1]. One of these pitfalls is associativity failure in floating point arithmetic. For example, the two sums of fractions calculations below will not produce the same result when using floats: In a sequential program, it is not a big problem since the calculation order is exactly specified so [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">Computer Arithmetic has a lot of peculiarities <a title="What every computer scientist should know about floating-point arithmetic, David Goldberg, Xerox Palo Alto Research Center, Palo Alto, CA, 1991." href="http://dx.doi.org/10.1145/103162.103163">[1]</a>. One of these pitfalls is associativity failure in floating point arithmetic. For example, the two sums of fractions calculations below will not produce the same result when using <code>float</code>s:</p>
<h4 style="text-align: center;"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/formula.png"><img class="size-large wp-image-47370 aligncenter" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/formula-1024x219.png" alt="The sum of fractions depend on the calculation order" width="461" height="99" align="middle" /></a></h4>
<p style="text-align: justify;">In a sequential program, it is not a big problem since the calculation order is exactly specified so the result is predictable and repeatable. The situation is not so clear in parallel programming.</p>
<p style="text-align: justify;">To make the example parallel, I used the parallel_reduce template function from Intel® Threading Building Blocks (Intel® TBB):</p>
<pre name="code" class="cpp:nocontrols">std::vector&lt;float&gt; arr( N, 1.0f/(float)N );
float sum = tbb::parallel_reduce( tbb::blocked_range( arr.begin(), arr.end() ), 0.0f,
    []( const tbb::blocked_range&amp; r, float sum ) {
        return std::accumulate( r.begin(), r.end(), sum );
    },
    std::plus&lt;float&gt;() );
std::cout &lt;&lt; sum &lt;&lt; std::endl;</pre>
<p style="text-align: justify;">As in the examples above, the code calculates the sum of N fractions, but it uses multiple processor cores if available. As it is well known, we face a disappointing fact of different results being possible for different orders of calculations. If we run it 10 times and N=1000 we will get something like this:</p>
<blockquote><p>0.999991<br />
1<br />
0.999999<br />
0.999996<br />
0.999998<br />
0.999998<br />
0.999998<br />
1<br />
0.999997<br />
0.999998</p></blockquote>
<p style="text-align: justify;">It’s worth mentioning that the result differs from run to run! In spite of the fact that the developer specifies the calculations – when it is calculated in parallel the order of calculation gets out of control.</p>
<p style="text-align: justify;">On the other hand, it is not as bad as all that. Although the OS operates on threads and fills the application with indeterminism, it is still possible to manage the order of calculations. One of the new features of Intel TBB 4.0 is the parallel_deterministic_reduce template algorithm. The algorithm has the same interface as parallel_reduce except that it does not allow you to specify a partitioner. (For parallel_reduce it is possible to pass a partitioner as the last argument.) We will discuss why this restriction exists later. But for now, let’s replace the parallel_reduce with parallel_deterministic_reduce and look at how the result changes:</p>
<pre name="code" class="cpp:nocontrols">std::vector&lt;float&gt; arr( N, 1.0f/(float)N );
float sum = tbb::parallel_deterministic_reduce( tbb::blocked_range( arr.begin(), arr.end() ), 0.0f,
    []( const tbb::blocked_range&amp; r, float sum ) {
        return std::accumulate( r.begin(), r.end(), sum );
    },
    std::plus&lt;float&gt;() );
std::cout &lt;&lt; sum &lt;&lt; std::endl;</pre>
<p>Again run it 10 times:</p>
<blockquote><p>1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1</p></blockquote>
<p style="text-align: justify;">The key point here is that the result is the same from run to run.</p>
<p style="text-align: justify;">The sources of non-determinism in parallel_reduce derive from partitioning and body splitting. Let’s consider each of these subjects:</p>
<ul style="text-align: justify;">
<li>Partitioning. The simple_partitioner determines exactly how many and which subranges are created. It splits the iteration range until each subrange is smaller than a given grain size. Thus the behavior only depends on the range size and grain size specified by the developer. However, other types of partitioning in Intel TBB are non-deterministic: to improve performance of the algorithms, range splitting provided by these partitioners depends on run-time stealing events, which we cannot predict.</li>
</ul>
<ul style="text-align: justify;">
<li>Body splitting. For performance reasons parallel_reduce minimizes body copies: it splits the body only when consecutive subranges are processed by different threads. Thus body splitting, like “advanced” partitioning, also depends on non-deterministic task stealing.</li>
</ul>
<p style="text-align: justify;">The example shows that parallel_reduce is really inapplicable for non-associative operations like floating point arithmetic. To achieve a repeatable result from a reduction with non-associative operations parallel_deterministic_reduce has been developed. From the considerations of partitioning (given above), it follows that only the simple_partitioner can be used for parallel_deterministic_reduce; and thus, no choice of an alternative partitioner is possible. Consequently, parallel_deterministic_reduce always challenges us with choosing an appropriate grain size. And smart body splitting has been disabled for the sake of deterministic behavior, so for each subrange a new body is created. This fact complicates the challenge of grain size selection even more: on the one hand, a small grain size increases the number of body copying and overall overhead, but on the other hand, a big grain size may lead to imbalance and underutilization. Fig. 1 shows the relative performance of parallel_deterministic_reduce (simple_partitioner with various grain sizes) in comparison with parallel_reduce (auto_partitioner with default grain size). An appropriate grain size provides the same performance of parallel_deterministic_reduce as parallel_reduce, - but an incorrectly chosen grain size may lead to significant performance degradation, as shown in Fig.1 at the extremes of the grain size axis.</p>
<h4 style="text-align: center;"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/chart.png"><img class="aligncenter size-full wp-image-47423" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/chart.png" alt="Fig.1. Comparison of parallel_reduce (auto_partitioner) and parallel_deterministic_reduce (simple_partitioner) on Pi calculation example." width="640" height="383" /></a><br />
Fig.1. Comparison of parallel_reduce (auto_partitioner) and parallel_deterministic_reduce (simple_partitioner) on Pi calculation example.</h4>
<p style="text-align: justify;">To demonstrate the split-join order behavior of parallel_deterministic_reduce, a small example is given with range [0, 20) and grain size = 5, similar to examples for parallel_reduce in the Intel TBB Reference manual:</p>
<h4 style="text-align: center;"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/tree.png"><img class="aligncenter size-full wp-image-47427" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/tree.png" alt="A tree of subranges" width="410" height="141" /></a><br />
A tree of subranges</h4>
<p style="text-align: justify;">For each right node a new body is created by the body split constructor. The slash marks (/) in the tree show where the body split is performed. Thus, for the current example the parallel_deterministic_reduce will always produce 4 subranges and 4 different bodies associated with them. Each of these subranges may be executed in parallel. When both children of a node finish, the corresponding bodies are merged: the right child body “added” to the left child body (in our examples via the <code>std::plus&lt;float&gt;()</code> binary function).</p>
<p style="text-align: justify;">To conclude, parallel_deterministic_reduce provides a deterministic number and deterministic sizes of subranges, and it exactly defines which pairs of subranges are merged. It’s important to note that a repeatable result obtained with parallel_deterministic_reduce may still be different from that obtained via serial execution. Moreover, the results may be different for various grain sizes, since range splitting depends on the grain size. Also, the algorithm is not targeted to improve the accuracy of calculations. The exact result of 1 in the above example of fraction sum calculation has been obtained by chance. For other examples the algorithm can cause a decrease in accuracy. Overall, parallel_deterministic_reduce is not a replacement to parallel_reduce but an alternative solution for those who need repeatability.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/05/11/deterministic-reduction-a-new-community-preview-feature-in-intel-threading-building-blocks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital Logic Simulation with the Intel® TBB Flow Graph, Part 3: Putting together a simulation</title>
		<link>http://software.intel.com/en-us/blogs/2012/05/05/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-3-putting-together-a-simulation/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/05/05/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-3-putting-together-a-simulation/#comments</comments>
		<pubDate>Sat, 05 May 2012 17:00:39 +0000</pubDate>
		<dc:creator>Terry Wilmarth (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[flow graph]]></category>
		<category><![CDATA[Intel TBB]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/05/05/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-3-putting-together-a-simulation/</guid>
		<description><![CDATA[In Part 2 of this blog, I described a four-bit adder circuit built from components discussed in Part 1. In this last installment, I’ll continue using Intel®TBB’s flow graph to put together some signal input and output devices, and then use those to make a small simulation featuring the four-bit adder from Part 2. Let’s [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/">Part 2</a> of this blog, I described a four-bit adder circuit built from components discussed in <a href="http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/">Part 1</a>.  In this last installment, I’ll continue using Intel®TBB’s <em>flow graph</em> to put together some signal input and output devices, and then use those to make a small simulation featuring the four-bit adder from Part 2.</p>
<p>Let’s look at two input devices here, the <em>toggle</em> and the <em>pulse</em> (or as I would have liked to have called them, the <em>switch</em> and the <em>clock</em>).  A toggle sends a signal of high or low, toggling between the two states, every time it is “toggled” or flipped.  A pulse continually alternates between the high and low states at a given duration.  The <code>toggle</code> class is implemented as follows:</p>
<p>
<pre>
<blockquote>class toggle {
    graph&#038; my_graph;
    signal_t state;
    overwrite_node < signal_t > toggle_node;
 public:
    toggle(graph&#038; g) : my_graph(g), state(undefined), toggle_node(g) {}
    toggle(const toggle&#038; src) : my_graph(src.my_graph), state(undefined),
                                toggle_node(src.my_graph) {}
    ~toggle() {}
    // Assignment ignored
    toggle&#038; operator=(const toggle&#038; src) { return *this; }
    sender < signal_t > &#038; get_out() { return toggle_node; }
    void flip() {
        if (state==high) state = low;
        else state = high;
        toggle_node.try_put(state);
    }
    void activate() {
        state = low;
        toggle_node.try_put(state);
    }
};</blockquote>
</pre>
<p>The toggle is represented internally by an <code>overwrite_node</code>, because it simply needs to keep track of one most-recent state. As an input device, it doesn’t receive output from any other items, so it has no explicit input ports, only actions (flip, activate) which can alter the output state.  The output port can of course be acquired via <code>get_out</code>, so that the toggle can be used to send signals into a circuit.</p>
<p>The <code>pulse</code> class is a little more interesting:</p>
<p>
<pre>
<blockquote>class pulse {
    class clock_body {
        size_t& ms;
        int& reps;
        signal_t val;
    public:
        clock_body(size_t&#038; _ms, int&#038; _reps) : ms(_ms), reps(_reps), val(low) {}
        bool operator()(signal_t&#038; out) {
            rt_sleep(ms);  // our own portable sleep function
            if (reps>0) --reps;
            if (val==low) val = high;
            else val = low;
            out = val;
            return reps>0 || reps == -1;
        }
    };
    graph&#038; my_graph;
    size_t ms, init_ms;
    int reps, init_reps;
    source_node < signal_t > clock_node;

public:
    pulse(graph&#038; g, size_t _ms=1000, int _reps=-1) :
        my_graph(g), ms(_ms), init_ms(_ms), reps(_reps), init_reps(_reps),
        clock_node(g, clock_body(ms, reps), false)
    {}
    pulse(const pulse&#038; src) :
        my_graph(src.my_graph), ms(src.init_ms), init_ms(src.init_ms),
        reps(src.init_reps), init_reps(src.init_reps),
        clock_node(src.my_graph, clock_body(ms, reps), false)
    {}
    ~pulse() {}
    pulse&#038; operator=(const pulse&#038; src) {
        ms = src.ms; init_ms = src.init_ms;
        reps = src.reps; init_reps = src.init_reps;
        return *this;
    }
    sender < signal_t > &#038; get_out() { return clock_node; }
    void activate() { clock_node.activate(); }
    void reset() { reps = init_reps; }
};</blockquote>
</pre>
<p>This class is based on the <code>source_node</code>.  It generates a signal, alternating between low and high, every <code>ms</code> milliseconds.  There is also an option to repeat the alternation a certain number of times and then stop, which is useful for designing simulations that use a clock but also terminate.  The <code>source_node</code> body sleeps for a duration before flipping the signal and sending it.  It doesn’t begin sending signals immediately, but requires activation.  In the case of a non-infinite clock (<code>reps</code> is set), once the pulse object has run for the given number of repetitions, it can be reset and reactivated to use it again.</p>
<p>Next, we discuss two output devices, the <em>LED</em> and the <em>digit</em>.  The LED is simply a tiny light that is on while the signal it is receiving is high, and off when the signal is low. For simple text display, the LED looks like this: (*) when it is on and ( ) when it is off. The digit device receives a four-bit input and displays a single hexadecimal digit.  For simulations, both devices have the option of continuously displaying their state as it changes, or a silent mode, which displays only when a <code>display</code> method is called.</p>
<pre>
<blockquote>
class led {
    class led_body {
        signal_t &state;
        string &label;
        bool report_changes;
        bool touched;
    public:
        led_body(signal_t &#038;s, string &#038;l, bool r) :
            state(s), label(l), report_changes(r), touched(false)
        {}
        continue_msg operator()(signal_t b) {
            if (!touched || b!=state) {
                state = b;
                if (state != undefined &#038;& report_changes) {
                    if (state) printf("%s: (*)\n", label.c_str());
                    else printf("%s: ( )\n", label.c_str());
                }
                touched = false;
            }
            return continue_msg();
        }
    };
    graph&#038; my_graph;
    string label;
    signal_t state;
    bool report_changes;
    function_node < signal_t, continue_msg > led_node;
 public:
    led(graph&#038; g, string l, bool rc=false) : my_graph(g), label(l), state(undefined),
        report_changes(rc), led_node(g, 1, led_body(state, label, report_changes))
    {}
    led(const led&#038; src) : my_graph(src.my_graph), label(src.label), state(undefined),
        report_changes(src.report_changes),
        led_node(src.my_graph, 1, led_body(state, label, report_changes))
    {}
    ~led() {}
    led&#038; operator=(const led&#038; src) {
        label = src.label; state = undefined; report_changes = src.report_changes;
        return *this;
    }
    receiver < signal_t > &#038; get_in() { return led_node; }
    void display() {
        if (state == high) printf("%s: (*)\n", label.c_str());
        else if (state == low) printf("%s: ( )\n", label.c_str());
        else printf("%s: (u)\n", label.c_str());
    }
};</blockquote>
</pre>
<p>The <code>led</code> class contains a simple <code>function_node</code> that has no meaningful output (we use a <code>continue_msg</code> to indicate this) and thus no successors.  Another way to implement this would be with an <code>overwrite_node</code>, but we would lose the <code>report_changes</code> functionality.  Similarly, the <code>digit</code> class also cannot have successors, but we reused the <code>gate</code> base class to implement it, since it has multiple bits of input and needs to update its state whenever one of the inputs changes.</p>
<pre>
<blockquote>
class digit : public gate < four_input > {
    using gate < four_input > ::my_graph;
    typedef gate < four_input > ::ports_type ports_type;
    typedef gate < four_input > ::input_port_t input_port_t;
    class digit_body {
        signal_t ports[4];
        unsigned int &state;
        string &label;
        bool&#038; report_changes;
    public:
        digit_body(unsigned int &#038;s, string &#038;l, bool&#038; r) : state(s), label(l), report_changes(r) {
            for (int i=0; i < N; ++i) ports[i] = undefined;
        }
        void operator()(const input_port_t::output_type&#038; v, ports_type&#038; p) {
            unsigned int new_state = 0;
            if (v.indx == 0) ports[0] = std::get < 0 > (v.result);
            else if (v.indx == 1) ports[1] = std::get < 1 > (v.result);
            else if (v.indx == 2) ports[2] = std::get < 2 > (v.result);
            else if (v.indx == 3) ports[3] = std::get < 3 > (v.result);
            if (ports[0] == high) ++new_state;
            if (ports[1] == high) new_state += 2;
            if (ports[2] == high) new_state += 4;
            if (ports[3] == high) new_state += 8;
            if (state != new_state) {
                state = new_state;
                if (report_changes) {
                    printf("%s: %x\n", label.c_str(), state);
                }
            }
        }
    };
    string label;
    unsigned int state;
    bool report_changes;
 public:
    digit(graph&#038; g, string l, bool rc=false) :
        gate < four_input > (g, digit_body(state, label, report_changes)),
        label(l), state(0), report_changes(rc) {}
    digit(const digit&#038; src) :
        gate < four_input > (src.my_graph, digit_body(state, label, report_changes)),
        label(src.label), state(0), report_changes(src.report_changes) {}
    ~digit() {}
    digit&#038; operator=(const digit&#038; src) {
        label = src.label; state = 0; report_changes = src.report_changes;
        return *this;
    }
    void display() { printf("%s: %x\n", label.c_str(), state); }
};</blockquote>
</pre>
<p>Because <code>digit</code> inherits from <code>gate</code>, it reuses <code>gate</code>’s <code>get_in</code> methods to connect to the ports of a <code>digit</code> object.</p>
<p>Here’s an example code to test out the four-bit adder. First, create a graph:</p>
<p>
<pre>
<blockquote>graph g;</blockquote>
</pre>
<p>Then, create the four-bit adder, some toggles with which to set the inputs to the adder, and a digit and an LED to display the output:</p>
<p>
<pre>
<blockquote>four_bit_adder four_adder(g);
std::vector < toggle > A(4, toggle(g));
std::vector < toggle > B(4, toggle(g));
toggle CarryIN(g);
digit Sum(g, "SUM");
led CarryOUT(g, "CarryOUT");</blockquote>
</pre>
<p>Next, connect our toggles to the input ports of the adder, and connect the adder’s output ports to the display devices:</p>
<p>
<pre>
<blockquote>for (int i=0; i<4; ++i) {
    make_edge(A[i].get_out(), four_adder.get_A(i));
    make_edge(B[i].get_out(), four_adder.get_B(i));
    make_edge(four_adder.get_out(i), Sum.get_in(i));
}
make_edge(CarryIN.get_out(), four_adder.get_CI());
make_edge(four_adder.get_CO(), CarryOUT.get_in());</blockquote>
</pre>
<p>Almost ready to go, activate all the switches at the low state so that everything starts at zero:</p>
<p>
<pre>
<blockquote>for (int i=0; i<4; ++i) {
    A[i].activate();
    B[i].activate();
}
CarryIN.activate();</blockquote>
</pre>
<p>Now I can start flipping toggles.  I’ve set digit and led to display only when requested by default, because I don’t want to see all the changes before this circuit reaches a steady state.  Let’s try 8+5:</p>
<p>
<pre>
<blockquote>A[3].flip();
B[0].flip();
B[2].flip();</blockquote>
</pre>
<p>Wait for the circuit to reach a steady state:>/p></p>
<p>
<pre>
<blockquote>g.wait_for_all();</blockquote>
</pre>
<p>Now display the results:</p>
<p>
<pre>
<blockquote>Sum.display();
CarryOUT.display();</blockquote>
</pre>
<p>And here they are:</p>
<p>
<blockquote><strong>SUM: d<br />
CarryOUT: ( )</strong></p></blockquote>
<p>And with that, I’ll wrap up this blog by saying that the logic simulation example code is available as an example in Intel® TBB 4.0 Update 4, and that it has several other interesting features, like push button and constant signal input devices, NAND and NOR gates, and a D-latch circuit example.  Please let us know of other interesting use cases for the <code>or_node</code> and any other feedback you’d be willing to give.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/05/05/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-3-putting-together-a-simulation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital Logic Simulation with the Intel® TBB Flow Graph, Part 2: Building bigger components</title>
		<link>http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/#comments</comments>
		<pubDate>Fri, 04 May 2012 17:00:05 +0000</pubDate>
		<dc:creator>Terry Wilmarth (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[flow graph]]></category>
		<category><![CDATA[Intel TBB]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/</guid>
		<description><![CDATA[In Part 1, I described how to put together a basic logic gate using the Intel® Threading Building Blocks flow graph nodes or_node and multifunction_node. In this blog, I will assume the basic logic gates and_gate, or_gate and xor_gate exist, and use them to construct a four-bit adder. To begin with, I’ll first construct a [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/">Part 1</a>, I described how to put together a basic logic gate using the Intel® Threading Building Blocks flow graph nodes <code>or_node</code> and <code>multifunction_node</code>.  In this blog, I will assume the basic logic gates <code>and_gate</code>, <code>or_gate</code> and <code>xor_gate</code> exist, and use them to construct a four-bit adder.</p>
<p>To begin with, I’ll first construct a one-bit full adder as in Figure 2 below:</p>
<div id="attachment_47264" class="wp-caption aligncenter" style="width: 629px"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig2.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig2-e1335816473799.png" alt="One-bit full adder" title="DLSfig2" width="619" height="308" class="size-full wp-image-47264" /></a><p class="wp-caption-text">Figure 2</p></div>
<p>The inputs are A and B, and a Carry-in bit, and the output is the sum S, and a Carry-out bit.  Here is the code for the <code>one_bit_adder</code> class:</p>
<pre>
<blockquote>
class one_bit_adder {
    broadcast_node < signal_t > A_port;
    broadcast_node < signal_t > B_port;
    broadcast_node < signal_t > CI_port;
    xor_gate < two_input > FirstXOR;
    xor_gate < two_input > SecondXOR;
    and_gate < two_input > FirstAND;
    and_gate < two_input > SecondAND;
    or_gate < two_input > FirstOR;
    graph&#038; my_graph;
    void make_connections() {
        make_edge(A_port, FirstXOR.get_in(0));
        make_edge(A_port, FirstAND.get_in(0));
        make_edge(B_port, FirstXOR.get_in(1));
        make_edge(B_port, FirstAND.get_in(1));
        make_edge(CI_port, SecondXOR.get_in(1));
        make_edge(CI_port, SecondAND.get_in(1));
        make_edge(FirstXOR.get_out(), SecondXOR.get_in(0));
        make_edge(FirstXOR.get_out(), SecondAND.get_in(0));
        make_edge(SecondAND.get_out(), FirstOR.get_in(0));
        make_edge(FirstAND.get_out(), FirstOR.get_in(1));
    }
public:
    one_bit_adder(graph&#038; g) :
        my_graph(g), A_port(g), B_port(g), CI_port(g), FirstXOR(g),
        SecondXOR(g), FirstAND(g), SecondAND(g), FirstOR(g)
    {
        make_connections();
    }
    one_bit_adder(const one_bit_adder&#038; src) :
        my_graph(src.my_graph), A_port(src.my_graph), B_port(src.my_graph),
        CI_port(src.my_graph), FirstXOR(src.my_graph), SecondXOR(src.my_graph),
        FirstAND(src.my_graph), SecondAND(src.my_graph), FirstOR(src.my_graph)
    {
        make_connections();
    }
    ~one_bit_adder() {}
    receiver < signal_t > &#038; get_A() { return A_port; }
    receiver < signal_t > &#038; get_B() { return B_port; }
    receiver < signal_t > &#038; get_CI() { return CI_port; }
    sender < signal_t > &#038; get_out() { return SecondXOR.get_out(); }
    sender < signal_t > &#038; get_CO() { return FirstOR.get_out(); }
};</blockquote>
</pre>
<p>This implementation is almost a straightforward translation of the gates and their connections into the flow graph format.  The one complication is the addition of the <code>broadcast_node</code>s for each of the input ports.  The reason for this is simply to enable connection to a single port from outside of the adder.  Since each of the inputs is connected to two gates inside of the <code>one_bit_adder</code> object, there is no single port associated with them automatically. Adding the <code>broadcast_node</code>s enables us to provide the methods <code>get_A</code>, <code>get_B</code> and <code>get_CI</code> that each return a single port capable of receiving data.  So, in looking at the diagram above, you can think of the three <code>broadcast_node</code>s as standing in for the black junction circles that the three inputs are connected to directly.</p>
<p>To make the <code>four_bit_adder</code> class, simply chain together a set of four <code>one_bit_adder</code>s and connect the Carry-out port of each adder to the Carry-in port of the next adder, as shown in Figure 3 below:</p>
<p>
<div id="attachment_47265" class="wp-caption aligncenter" style="width: 606px"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig3.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig3-e1335816730363.png" alt="Four-bit adder" title="DLSfig3" width="596" height="475" class="size-full wp-image-47265" /></a><p class="wp-caption-text">Figure 3</p></div>
<p>This time, the class is even more straightforward to implement, because no <code>broadcast_node</code>s are needed; every input already has exactly one internal connection.</p>
<p>
<pre>
<blockquote>class four_bit_adder {
    graph&#038; my_graph;
    std::vector < one_bit_adder > four_adders;
    void make_connections() {
        make_edge(four_adders[0].get_CO(), four_adders[1].get_CI());
        make_edge(four_adders[1].get_CO(), four_adders[2].get_CI());
        make_edge(four_adders[2].get_CO(), four_adders[3].get_CI());
    }
 public:
    four_bit_adder(graph&#038; g) : my_graph(g), four_adders(4, one_bit_adder(g)) {
        make_connections();
    }
    four_bit_adder(const four_bit_adder&#038; src) :
        my_graph(src.my_graph), four_adders(4, one_bit_adder(src.my_graph))
    {
        make_connections();
    }
    ~four_bit_adder() {}
    receiver < signal_t > &#038; get_A(size_t bit) {
        return four_adders[bit].get_A();
    }
    receiver < signal_t > &#038; get_B(size_t bit) {
        return four_adders[bit].get_B();
    }
    receiver < signal_t > &#038; get_CI() {
        return four_adders[0].get_CI();
    }
    sender < signal_t > &#038; get_out(size_t bit) {
        return four_adders[bit].get_out();
    }
    sender < signal_t > &#038; get_CO() {
        return four_adders[3].get_CO();
    }
};</blockquote>
</pre>
<p>Here, the constructor makes a vector of exactly four adders, and connects the Carry-out ports to the Carry-in ports as appropriate.  The multi-bit inputs and outputs have port access methods that take a bit as a parameter.  So for example, to get the input port for bit 2 of input B, you would use <code>get_B(2)</code>.</p>
<p>In <a href="http://software.intel.com/en-us/blogs/2012/05/05/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-3-putting-together-a-simulation/">Part 3</a>, I will present some interesting input and output devices to add to the logic simulation library, and with those, I’ll put together a small simulation that shows the <code>four_bit_adder</code> in action.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital Logic Simulation with the Intel® TBB Flow Graph, Part 1: Using the or_node</title>
		<link>http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/#comments</comments>
		<pubDate>Thu, 03 May 2012 17:00:56 +0000</pubDate>
		<dc:creator>Terry Wilmarth (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[flow graph]]></category>
		<category><![CDATA[Intel TBB]]></category>
		<category><![CDATA[or_node]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/</guid>
		<description><![CDATA[In this multi-part blog, I’m going to show you how to put together a simple logic simulation program using the Intel® Threading Building Blocks flow graph feature. Please note that this example does NOT demonstrate a practical approach to digital logic simulation. The purpose of the example is to demonstrate the use of several types [...]]]></description>
			<content:encoded><![CDATA[<p>In this multi-part blog, I’m going to show you how to put together a simple logic simulation program using the Intel® Threading Building Blocks <em>flow graph</em> feature. Please note that this example does NOT demonstrate a practical approach to digital logic simulation.  The purpose of the example is to demonstrate the use of several types of flow graph nodes and how they can be composed to make more interesting components.  I’ll start by designing basic logic gates that are composed of flow graph nodes.</p>
<p>Consider an AND gate.  In its simplest form, it takes two inputs, and produces a single output.  The first thing that comes to mind to represent this is the flow graph <code>function_node</code>:  it could take a pair as input, and a body that computes the logical AND operation on the items in the pair, and puts out the result as its output.  That might work, but let’s think a little more about how such a gate might receive its two input signals: a <code>function_node</code> takes a single argument, so I’d have to group the two inputs together.  However, both inputs will be coming from different senders, and may not be available at the same time. Should I preface the <code>function_node</code> with a <code>join_node</code>? Possibly, but there’s still a limitation with a <code>join_node</code>: it gathers together the inputs and when it has received the full complement, it then sends them along as a tuple.  But this still isn’t exactly the behavior I want.  What I really want is when either of the inputs becomes available, the <code>function_node</code> should be told about it, because it will need to change its output value when any of its input values change. </p>
<p>Thus, the first decision about gates is this: Gates are responsive: when any input changes, the gate will check if its output needs to change. To simplify this a little, and make our flow graph have to do a little less work, I’ll make this second decision: Gates are lazy; a gate will send data to its output port only when that data differs from the previous value sent to that output port.  This will certainly reduce the number of tasks doing redundant work in the graph. </p>
<p>So, on the input side, something reports changes on any input port, and on the output side, something produces output, or not, depending on if the output value has changed. Neither of these behaviors corresponds exactly to a <code>function_node</code>.  However, the new feature <code>multifunction_node</code> (formerly the <a href="http://software.intel.com/en-us/articles/intel-tbb-community-preview-features/">Community Preview feature</a> (CPF) <code>multioutput_function_node</code>) can certainly meet the output needs: it can optionally produce an output.  For the input, if the title of this blog hasn’t given it away already, my choice is the <code>or_node</code>.  The <code>or_node</code> will pass along any input it receives on any input port at any time, giving exactly the responsiveness I need.  The <code>or_node</code> is currently a CPF in Intel® TBB.</p>
<div id="attachment_47231" class="wp-caption aligncenter" style="width: 280px"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig1.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig1-e1335811781181.png" alt="gate template" title="DLSfig1" width="270" height="97" class="size-full wp-image-47231" /></a><p class="wp-caption-text">Figure 1</p></div>
<p>Figure 1 illustrates this basic logic gate design.  Note that the <code>or_node</code> takes a variable number of inputs – no need to limit it to two – and the <code>multifunction_node</code> takes a body that produces either no output or one output. In general, the <code>multifunction_node</code> can produce zero or more outputs of varying types, but for the gate implementation, zero or one output will suffice.  Let’s take a look at the actual code for this, the <code>gate</code> template class.</p>
<p>First, I set up a type <code>signal_t</code> to represent the signal data being transferred.  Since I’m allowing the gates to fire only when the output state changes, it helps to have an additional <code>undefined</code> state for initialization.</p>
<p>
<pre>
<blockquote>typedef enum { low=0, high, undefined } signal_t;</blockquote>
</pre>
<p>Next, I define a few potential input configurations to gates.  I could go all out and add <code>eight_input</code> gates, but I couldn’t dredge up a use for them from the dark and rarely-visited corner of my brain where I keep the knowledge leftover from a digital logic course so many years ago.</p>
<pre>
<blockquote>typedef tuple < signal_t > one_input;
typedef tuple < signal_t, signal_t > two_input;
typedef tuple < signal_t, signal_t, signal_t > three_input;
typedef tuple < signal_t, signal_t, signal_t, signal_t > four_input;
</blockquote>
</pre>
<p>Now I’m ready to set up the gate template.</p>
<pre>
<blockquote>template < typename GateInput >
class gate {
protected:
    typedef or_node < GateInput > input_port_t;
    typedef multifunction_node < typename input_port_t::output_type, tuple < signal_t > > gate_fn_t;
    typedef typename gate_fn_t::output_ports_type ports_type;
public:
    static const int N = std::tuple_size < GateInput > ::value;

    template < typename Body >
    gate(graph&#038; g, Body b) : my_graph(g), in_ports(g), gate_fn(g, 1, b) {
        make_edge(in_ports, gate_fn);
    }
    virtual ~gate() {}
    virtual gate&#038; operator=(const gate&#038; src) { return *this; }
    sender < signal_t > &#038; get_out() { return output_port < 0 > (gate_fn); }
    receiver < signal_t > &#038; get_in(size_t port) {
        return gate_helper < N > ::get_inport(in_ports, (int)port);
    }
protected:
    graph&#038; my_graph;
private:
    input_port_t in_ports;
    gate_fn_t gate_fn;
};</blockquote>
</pre>
<p>The class is templated by the input configuration, <code>GateInput</code>, so for example, I would pass in <code>two_input</code> if I wanted to make a gate with two inputs. Then I define two types. First, <code>input_port_t</code>, which is the type of the <code>or_node</code> that that I’ll pass the input configuration to, as specified by <code>GateInput</code>. Second is <code>gate_fn_t</code>, which is the <code>multifunction_node</code> that takes the output from the <code>or_node</code>, performs the function of the gate, and outputs a single <code>signal_t</code> (or nothing).  These types are used to declare the actual graph nodes <code>in_ports</code> and <code>gate_fn</code>, in the private section of the class above.</p>
<p>The <code>gate</code> constructor initializes the two graph nodes, making them belong to a graph <code>g</code> that is passed in as a reference parameter.  Additionally, the constructor takes a function object <code>b</code> that performs the actual logical operation on the inputs to the gate, and determines what the new output will be.  So in the case of an AND gate, I would pass in a function object that computes a logical AND operation.  The constructor also completes this small component by connecting the two graph nodes with the <code>make_edge</code> function.</p>
<p>In order to connect this gate to other components, I’ve provided methods to access the input ports and the output port.  <code>get_in</code> takes a port number and returns a reference to an input port capable of receiving data, i.e. a <code>receiver<signal_t>&#038;</code> in the flow graph jargon.   It uses the <code>gate_helper<N>::get_inport</code> function shown below to extract the input port to the <code>or_node</code>.  <code>get_out</code> returns a reference to the output port of the <code>multifunction_node</code> which is capable of sending data, i.e. a <code>sender<signal_t>&#038;</code>.</p>
<p>
<pre>
<blockquote>template < int N >
struct gate_helper {
    template < typename TupleType >
    static inline receiver < signal_t > &#038; get_inport(or_node < TupleType > &#038; in_ports, int port) {
        if (N-1 == port) return input_port < N-1 > (in_ports);
        else return gate_helper < N-1 > ::get_inport(in_ports, port);
    }
};
template < >
struct gate_helper < 1 > {
    template < typename TupleType >
    static inline receiver < signal_t > &#038; get_inport(or_node < TupleType > &#038; in_ports, int port) {
        return input_port < 0 > (in_ports);
    }
};
</blockquote>
</pre>
<p>Now that I have a building block for creating a wide variety of logic gates, I’ll use it for designing an AND gate.  When creating the derived class <code>and_gate</code>, the main purpose is to define the functor that gets passed to the <code>gate_fn</code> object inside the <code>gate</code> base class.  <code>and_body</code> computes a logical AND operation over all the inputs to the gate, including undefined inputs, so the function is not completely trivial.</p>
<p>
<pre>
<blockquote>template < typename GateInput >
class and_gate : public gate < GateInput > {
    using gate < GateInput > ::N;
    using gate < GateInput > ::my_graph;
    typedef typename gate < GateInput > ::ports_type ports_type;
    typedef typename gate < GateInput > ::input_port_t input_port_t;
    class and_body {
        signal_t ports[N];
        signal_t state;
        bool touched;
    public:
        and_body() : state(undefined), touched(false)
            for (int i=0; i < N; ++i) ports[i] = undefined;
        }
        void operator()(const typename input_port_t::output_type&#038; v, ports_type&#038; p) {
            ports[v.indx] = or_output_helper < N > ::get_or_output(v);
            signal_t new_state=high;
            size_t i=0;
            while (i < N) {
                if (ports[i] == low)
                    new_state = low; break;
                else if (ports[i] == undefined &#038;& new_state != low)
                    new_state = undefined;
                ++i;
            }
            if (!touched || state != new_state) {
                state = new_state;
                std::get < 0 > (p).try_put(state);
                touched = true;
            }
        }
    };
 public:
    and_gate(graph&#038; g) : gate < GateInput > (g, and_body()) {}
    and_gate(const and_gate < GateInput > &#038; src) : gate < GateInput > (src.my_graph, and_body()) {}
    ~and_gate() {}
};</blockquote>
</pre>
<p>The <code>and_body</code> keeps track of the states of the gate’s input ports and output port.  These are all initially <code>undefined</code>.  The <code>operator()</code> for <code>and_body</code> receives the <code>or_node</code> output in parameter <code>v</code>, which indicates that data was received on one of the input ports.  The input port that received data is specified in <code>v.indx</code>.  Accessing the data from that port is a little more challenging, as the entire input tuple is passed in <code>v.result</code>.  I wrote a helper function <code>or_output_helper<N>::get_or_output</code> to select the <code>v.indx</code>-th port of the tuple <code>v.result</code>.  This value is used to update the locally stored state of the appropriate port, and then the new output state is calculated.  The new state is checked to see if it differs from the old state, and if so, the new state is sent out on the appropriate output port of the <code>multifunction_node</code> (which in this case, since there is only one output, is always port zero).  Note also that the very first time a gate receives data, i.e. when <code>touched</code> is false, the new state is sent out even if it is not different from the initial state.  This is useful when the gate is a part of a larger circuit.  It allows any initial settings on input ports to propagate through the graph and register at any possible output devices that might exist.</p>
<p>The helper function that extracts the <code>or_node</code> output is as follows:</p>
<p>
<pre>
<blockquote>template < int N >
struct or_output_helper {
    template < typename OrOutputType >
    static inline signal_t get_or_output(const OrOutputType&#038; out) {
        if (N-1 == out.indx) return std::get < N-1 > (out.result);
        else return or_output_helper < N-1 > ::get_or_output(out);
    }
};
template < >
struct or_output_helper < 1 > {
    template < typename OrOutputType >
    static inline signal_t get_or_output(const OrOutputType&#038; out) {
        return std::get < 0 > (out.result);
    }
};</blockquote>
</pre>
<p>Given an AND gate, it’s easy to see how to make OR gates and any other sort of basic logic gate from the base class <code>gate</code>.</p>
<p>As the <code>or_node</code> is currently a Community Preview feature, it’s a good time to have a look at it and give us your feedback.</p>
<p>In <a href="http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/">Part 2</a> of this blog, I’ll show you how to put together a variety of basic logic gates to make a four-bit adder.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Aggregator: a new Community Preview Feature in Intel® Threading Building Blocks</title>
		<link>http://software.intel.com/en-us/blogs/2012/05/02/aggregator-a-new-community-preview-feature-in-intel-threading-building-blocks/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/05/02/aggregator-a-new-community-preview-feature-in-intel-threading-building-blocks/#comments</comments>
		<pubDate>Wed, 02 May 2012 17:00:48 +0000</pubDate>
		<dc:creator>Terry Wilmarth (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Community preview feature]]></category>
		<category><![CDATA[Concurrency control]]></category>
		<category><![CDATA[Intel TBB]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/05/02/aggregator-a-new-community-preview-feature-in-intel-threading-building-blocks/</guid>
		<description><![CDATA[Intel® Threading Building Blocks (Intel® TBB) 4.0 Update 4 introduces a new Community Preview feature, the aggregator. An internal version of the aggregator has been in use in Intel® TBB for some time, appearing in the flow graph and concurrent priority queue implementations. An aggregator is like a mutex in that it enforces mutually exclusive [...]]]></description>
			<content:encoded><![CDATA[<p>Intel® Threading Building Blocks (Intel® TBB) 4.0 Update 4 introduces a new <a href="http://software.intel.com/en-us/articles/intel-tbb-community-preview-features/?wapkw=community+preview+feature">Community Preview feature</a>, the <em>aggregator</em>.  An internal version of the aggregator has been in use in Intel® TBB for some time, appearing in the flow graph and concurrent priority queue implementations. An aggregator is like a mutex in that it enforces mutually exclusive access to a critical section of program code.  However, it can perform better than a mutex in many cases. It differs significantly from a mutex in how it works, and that can have deeper implications on how it performs and how it can be used.  It does its magic by aggregating the critical sections from multiple threads into a single critical section executed by a single thread, which can have a significant impact on cache performance. </p>
<p>There are two modes of use for this feature: basic mode and expert mode.  Basic mode is straightforward and not much more complex than using a mutex. Expert mode requires some understanding of how the aggregator works, and additional coding, but can enable additional performance improvements. In this blog, I will first illustrate how to use the aggregator in the basic mode.  Then I’ll give a brief overview of how the aggregator works, followed by an example of how to use the aggregator in the expert mode.  Finally, I’ll examine the performance of the aggregator and suggest approaches to help decide whether or not to use it.</p>
<p><strong>Side-by-side Comparison of Basic Aggregator Usage with Mutex Usage</strong></p>
<p>In this simple example, I’ll compare the usage of a mutex with an aggregator to lock <code>push</code> and <code>pop</code> operations on a serial priority queue object of type <code>std::priority_queue</code>.  This example uses C++1x features, such as lambdas, but one could use function objects instead.  Fair warning: I’m interspersing code snippets below, because this blog format doesn’t allow for side-by-side code comparison.  Please don’t try to use both a mutex and an aggregator to protect the same code.</p>
<p>First, declare the priority queue.  I'll use a simple integer priority queue here:</p>
<p>
<pre>
<blockquote>typedef int value_type;
typedef priority_queue < value_type, std::vector < value_type > , compare_type > pq_t;
pq_t my_pq;</blockquote>
</pre>
<p>Declare a mutex to protect <code>my_pq</code>:</p>
<p>
<pre>
<blockquote>spin_mutex my_mutex;</blockquote>
</pre>
<p>Alternatively, declare an aggregator to protect <code>my_pq</code>:</p>
<p>
<pre>
<blockquote>aggregator my_aggregator;</blockquote>
</pre>
<p>Declare an element to push/pop from queue:</p>
<p>
<pre>
<blockquote>value_type elem = 42;</blockquote>
</pre>
<p>Now, push an element on the queue using the mutex:</p>
<p>
<pre>
<blockquote>{
    tbb::spin_mutex::scoped_lock my_lock(my_mutex);
    my_pq.push(elem);
}</blockquote>
</pre>
<p>Or, push the element on the queue using the aggregator and a lambda expression:</p>
<p>
<pre>
<blockquote>my_aggregator.execute( [&#038;my_pq, &#038;elem](){
    my_pq.push(elem);
} );</blockquote>
</pre>
<p>Pop an element off the queue using the mutex:</p>
<p>
<pre>
<blockquote>bool result = false;
{
    tbb::spin_mutex::scoped_lock my_lock(*my_mutex);
    if (!my_pq.empty()) {
        result = true;
        elem = my_pq.top();
        my_pq.pop();
    }
}</blockquote>
</pre>
<p>Pop an element off the queue using the aggregator:</p>
<p>
<pre>
<blockquote>bool result = false;
my_aggregator.execute( [&#038;my_pq, &#038;elem, &#038;result](){
    if (!my_pq.empty()) {
        result = true;
        elem = my_pq.top();
        my_pq.pop();
    }
} );</blockquote>
</pre>
<p><strong>How the Aggregator Works</strong></p>
<p>As we see above, the usage of the aggregator in basic mode is trivially different from using a mutex.  However, it is clearly working in a different way.  In order to execute a critical section, you pass it to an aggregator via the <code>execute</code> method.  When the <code>execute</code> method returns, the critical section has been executed, but how this happened is hidden inside the black box of the aggregator.  </p>
<p>Looking at the header file <code>aggregator.h</code> that defines the <code>aggregator</code>, these details become clear.  To use the aggregator in expert mode, you should have some familiarity with the header file, and I'll guide you through the most important features in the rest of this blog.</p>
<p>First note that aggregator inherits from a class <code>aggregator_ext</code> that takes a template parameter.  <code>Aggregator</code> instantiates that template parameter with a simple handler defined in the header, <code>handler_type = internal::basic_handler</code>.  We will discuss this more later.</p>
<p>The <code>execute</code> method of <code>aggregator</code> takes a function body as parameter, and encapsulates <code>body</code> in a <code>basic_operation</code> object, which inherits from <code>aggregator_operation</code>.  <code>Aggregator_operation</code>s are sent to the <code>aggregator_ext</code>’s <code>mailbox</code> where they may concurrently accumulate while they await execution.  One thread, the <em>active handler</em>, i.e. the first thread to place an <code>aggregator_operation</code> in the empty <code>mailbox</code>, will grab all the operations that have accumulated there, effectively emptying the <code>mailbox</code>.  It will then go through all the operations that it grabbed, and serially execute the function bodies stored in those objects.  The mechanism used to execute function bodies is specified by <code>aggregator_ext</code>’s template parameter, which in the default case is called <code>basic_handler</code>.</p>
<p>This <code>basic_handler</code> is straightforward in its functioning: it is passed the list of <code>aggregator_operation</code>s, and it loops through this list and handles each item.  It makes use of a few methods on <code>aggregator_operation</code> to do this properly: <code>next</code> is used to traverse to the next operation in the list, <code>start</code> prepares the operation to be handled, and <code>finish</code> is called after the operation is handled to inform the thread waiting on the execution of the operation that the operation is completed.  When all operations are handled, the active handler thread can leave the <code>aggregator</code>, since its own call to <code>execute</code> has been satisfied in the process.</p>
<p>The details of the synchronization that make this all possible can be found in <code>aggregator.h</code>.  We won’t explain them fully here, because we already have enough information to proceed to use the aggregator in expert mode.  It is enough to know that threads hand over critical sections to the aggregator, and one of these threads will execute all the operations serially on behalf of the other threads as a single critical section.</p>
<p><strong>Using the Aggregator in Expert Mode</strong></p>
<p>I’ll use the same example as before, allowing threads to safely push and pop to a serial <code>std::priority_queue</code>.  The expert mode of aggregator allows the user to pass any sort of data in to the aggregator as an <code>aggregator_operation</code> via the <code>process</code> method (note the different method name – we were using <code>execute</code> in basic mode), along with an aggregating function object that is called by the active handler to perform the serial execution of operations.  In this case, I’ll pass data about a push or pop operation to the aggregator via <code>process</code>, and provide a custom function object to perform the operations.</p>
<p>First, create a class derived from <code>aggregator_operation</code> to hold the operation data.</p>
<p>
<pre>
<blockquote>class op_data : public aggregator_operation {
public:
    value_type* elem;
    bool success;
    bool is_push;
    op_data(value_type* e, bool push=false): elem(e), success(false), is_push(push) {}
};</blockquote>
</pre>
<p>Then, create a handler to pass in as the aggregator’s template parameter:</p>
<p>
<pre>
<blockquote>class my_handler_t {
    pq_t *pq;
public:
    my_handler_t() {}
    my_handler_t(pq_t *pq_) : pq(pq_) {}
    void operator()(aggregator_node* op_list) {
        op_data* tmp;
        while (op_list) {
            tmp = (op_data*)op_list;
            op_list = op_list->next();
            tmp->start();
            // handle tmp here
            if (tmp->is_push) pq->push(*(tmp->elem));
            else {
                if (!pq->empty()) {
                    tmp->success = true;
                    *(tmp->elem) = pq->top();
                    pq->pop();
                }
            }
            // done handling tmp
            tmp->finish();
        }
    }
};</blockquote>
</pre>
<p>Now, to create an aggregator, use the <code>aggregator_ext</code> type name and pass this handler’s type in as the template parameter, and initialize the handler and pass it in as an argument to the constructor:</p>
<p>
<pre>
<blockquote>aggregator_ext < my_handler_t > my_aggregator(my_handler_t(my_pq));</blockquote>
</pre>
<p>To perform a push, simply create the <code>op_data</code> node with the push information and pass it to <code>process</code>:</p>
<p>
<pre>
<blockquote>op_data my_push_op(&#038;elem, true);
my_aggregator.process(&#038;my_push_op);</blockquote>
</pre>
<p>And to perform a pop:</p>
<p>
<pre>
<blockquote>bool result;
op_data my_pop_op(&#038;elem);
my_aggregator.process(&#038;my_pop_op);
result = my_pop_op.success;</blockquote>
</pre>
<p><strong>When to use Aggregator and why use Expert Mode?</strong></p>
<p>A good way to start is to compare the performance of your code using your current locking mechanism to a version of your code that uses an aggregator instead.  In practice, we (developers of TBB) have often found that a mutex is sufficient and outperforms aggregator when contention on the critical region is low. For higher contention, we often find that the use of the aggregator is justified.</p>
<p>The aggregator provides most of its performance improvements in hot cache execution of operations on a single thread. (Recall the <em>active handler</em>?)  Thus, the more concurrent contention on your critical region, the larger the aggregations will be that are assembled, and the greater the benefits of executing operations with a hot cache on a single thread.</p>
<p>If you do find that the basic aggregator improves your code’s performance, consider moving to the expert level.  To begin with, you can simply transform your code as I’ve shown in the expert example above.  This should result in better performance over the basic interface.  The reason for this is that, in the basic interface, the function object or lambda expression you wish to execute and all the references to data that you want that code to access are stored on the stack of the thread that originated the operation. Referring back to the basic example, this means that for each operation, we look up a different reference to the same priority queue.  But, in the expert example above, note that we store just a few data references in the <code>aggregator_operation</code>, and the code to execute the operation and references to the shared data (<code>my_pq</code>) are local to the aggregating functor and only need to be looked up once to handle all the operations in an aggregation.  This enhances the hot cache effect by reducing the quantity of non-local stack accesses.</p>
<p>The expert-level usage of aggregator shown above is quite straightforward.  However, you are free to handle operations in the aggregating handler in whatever manner you like.  Consider the aggregation of operations an opportunity to develop new and interesting serial algorithms.  This gives you a unique opportunity to make use of a kind of <em>lookahead</em> capability: you know the set of operations that you need to perform. For example, Intel® TBB’s <code>concurrent_priority_queue</code> handles the operations in two passes, performing some of them and postponing others, because some orderings of operations are more efficient than others.  The only rules for processing operations in the aggregating handler are that they should all be handled, and, in some cases, there should be some serial sequence of the operations that achieves the same result (i.e. sequential consistency).</p>
<p>I’d like to hear about your experiences using aggregator, so if you get a chance, give it a try, and let me know how it went!  You can comment here, or better yet, start a discussion on the <a href="http://software.intel.com/en-us/forums/intel-threading-building-blocks/">Intel® TBB forum</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/05/02/aggregator-a-new-community-preview-feature-in-intel-threading-building-blocks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scalable Memory Pools: community preview feature</title>
		<link>http://software.intel.com/en-us/blogs/2011/12/19/scalable-memory-pools-community-preview-feature/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/12/19/scalable-memory-pools-community-preview-feature/#comments</comments>
		<pubDate>Mon, 19 Dec 2011 13:05:33 +0000</pubDate>
		<dc:creator>Anton Malakhov (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[memory pool]]></category>
		<category><![CDATA[scalable allocator]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[TBB 4.0]]></category>
		<category><![CDATA[tbbmalloc]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/12/19/scalable-memory-pools-community-preview-feature/</guid>
		<description><![CDATA[In TBB 4.0, we introduced new community preview feature (CPF) – the scalable memory pools. See the TBB Reference Manual (D.4) for formal and detailed description. In this blog, we will present them less formally and discuss what changes can be made. Motivation We had vague requests from customers to implement a memory pool (Wikipedia [...]]]></description>
			<content:encoded><![CDATA[<p>In TBB 4.0, we introduced new community preview feature (<a title="About Community Preview Features" href="http://software.intel.com/en-us/articles/intel-tbb-community-preview-features/">CPF</a>) – the scalable memory pools. See the TBB <a href="http://threadingbuildingblocks.org/documentation.php">Reference Manual</a> (D.4) for formal and detailed description. In this blog, we will present them less formally and discuss what changes can be made.</p>
<h2>Motivation</h2>
<p style="text-align: justify;">We had vague requests from customers to implement a memory pool (Wikipedia calls it <a href="http://en.wikipedia.org/wiki/Region-based_memory_management">region</a>) or some of its properties in the TBB scalable memory allocator. We summarized these requests and general information on memory pools from the Internet and got the following compilation of major properties and abilities:</p>
<ul>
<li>Memory pools basically do the same job as standard memory allocators but additionally group memory objects under umbrella of a specific pool instance which enables:
<ul>
<li>fast deallocation of all the memory at once on pool destruction or for sake of further reuse</li>
<li>less memory fragmentation and related synchronization between independent groups</li>
</ul>
</li>
<li>Memory pools allow more control over acquisition and release of memory resources, and may have user-specific sources of memory:
<ul>
<li>memory chunk/buffer of a fixed size</li>
<li>redirection to a specific memory provider, e.g. standard or custom implementation of malloc, big memory pages, memory tied to specific NUMA node, IPC shmem regions.</li>
</ul>
</li>
</ul>
<p style="text-align: justify;">To squeeze more performance and to fight memory fragmentation, some specific implementations allocate objects of fixed size only (so called object pools, e.g. <a href="http://www.boost.org/doc/libs/1_48_0/libs/pool/doc/html/index.html">boost::pool</a>, Wikipedia calls it <a title="Wiki" href="http://en.wikipedia.org/wiki/Memory_pool">memory pool</a>) or are unable to deallocate individual object ("arena allocator"). In our implementation, we tried to provide more general functionality in thread-safe and scalable way. For that purpose, the implementation of the memory pools is based on TBB scalable memory allocator and so has similar speed and memory consumption properties. Later we may address more specific use cases, based on the feedback.</p>
<h2>Usage</h2>
<p style="text-align: justify;">Our memory pools API consists of two classes for thread-safe memory management: <em>tbb::fixed_pool</em> and <em>tbb::memory_pool</em>. The first one is for the simple case when an already allocated memory block and is used for allocation of smaller objects. And the second one utilizes a user-specified memory provider to obtain big chunks of memory where smaller objects reside. As opposed to fixed_pool, memory_pool is able to grow on demand and relinquish unused chunks back to the provider.</p>
<p>Both classes provide familiar methods for allocation and deallocation:</p>
<pre name="code" class="cpp:nogutter:nocontrols">void *ptr = my_pool.malloc( (size_t) 10 );  // allocate 10 bytes
ptr = my_pool.realloc( ptr, (size_t) 12 );  // extend the allocation to 12 bytes
my_pool.free( ptr );                        // deallocate it</pre>
<p>Additionally, there is a method which deallocates all the memory at once, i.e. it is a faster equivalent to a series of calls to my_pool.free() for each pointer obtained in this pool by previous calls to my_pool.malloc():</p>
<pre name="code" class="cpp:nogutter:nocontrols">my_pool.recycle();  // Frees all the memory in the pool for reuse</pre>
<p>Please note, that it is not thread-safe to call it concurrently to other methods on the same instance (similarly to clear() method in containers).<br />
We also provide an (almost, except absence of default constructor) STL-compliant allocator class to enable pools inside STL containers:</p>
<pre name="code" class="cpp:nogutter:nocontrols">typedef tbb::memory_pool_allocator&lt;int&gt; pool_allocator_t;
std::list&lt;int, pool_allocator_t&gt; my_list( (pool_allocator_t( my_pool )) );</pre>
<p>Now, the only thing that holds us back from the first experiment with this new feature of TBB is the question – how to create the ‘my_pool’.  First, we need to enable this feature and include the header:</p>
<pre name="code" class="plain:nogutter:nocontrols">#define TBB_PREVIEW_MEMORY_POOL 1
#include “tbb/memory_pool.h”</pre>
<p>If you want to create a memory pool on top of your memory block, let’s specify its address and size in bytes to the constructor of tbb::fixed_pool class, as in following excerpt:</p>
<pre name="code" class="cpp:nogutter:nocontrols">char buffer[1024*1024];
// The casts below are just to show the types of arguments.
tbb::fixed_pool my_pool( (void*)buffer, (size_t)1024*1024*sizeof(char) );</pre>
<p style="text-align: justify;">The maximal amount of memory which can be allocated from the pool declared above is limited by size of the buffer minus some space for control structures. And if you want to avoid this limitation, let’s use tbb::memory_pool template class specifying memory provider (which will be discussed later) as its template argument:</p>
<pre name="code" class="cpp:nogutter:nocontrols">tbb::memory_pool&lt; std::allocator&lt;char&gt; &gt; my_pool(/*optionally: allocator instance*/);</pre>
<p style="text-align: justify;">You can specify any STL-compatible allocator as the memory provider (though this is a subject to change). It will provide (big) memory chunks for  my_pool when necessary. The destructor of the memory_pool class implies release of all the memory chunks back to the memory provider.</p>
<p>Let’s consolidate our knowledge in one artificial example:</p>
<pre name="code" class="cpp">// Link this with tbbmalloc library
#define TBB_PREVIEW_MEMORY_POOL 1
#include "tbb/memory_pool.h"
#include &lt;list&gt;
#include &lt;stdio.h&gt;

void main() {
    static char buf[1024*1024*4]; // buffer for interim data
    tbb::fixed_pool interim_pool(buf, sizeof(buf)); // pool for temporary objects
    tbb::memory_pool&lt; std::allocator&lt;char&gt; &gt; result_pool; // pool to store the results

    typedef tbb::memory_pool_allocator&lt;int&gt; result_allocator_t; // interface to STL containers
    std::list&lt;int, result_allocator_t&gt; result_list( (result_allocator_t( result_pool )) );

    for(int result = 0, i = 0; i &lt; 100; i++, result = 0) {
        for(int j = 0; j &lt; 1000000; j++) {
            int *p = (int*)interim_pool.malloc(4);
            if( p ) result++; // really dummy :)
        }
        // in real application, here can be some processing of allocated objects
        result_list.push_back(result); // no memory fragmentation here - separate pool
        interim_pool.recycle(); // free all the interim objects
        printf("%d\n", result); // should be the same number on each iteration
    }
} // all the memory is released back implicitly</pre>
<p style="text-align: justify;">The simple part is done, and I hope that you are interested enough to proceed with more complex questions, and tell us what you think about it.</p>
<p style="text-align: justify;">Someone may want to know whether it is possible to construct a pool in a memory allocated form another pool. It is possible, but one should take care to destroy the inner pool prior to destruction of the outer pool or a call to recycle(). Do you know a good reason to enable such a nesting?</p>
<h2>Memory provider interface</h2>
<p style="text-align: justify;">From an API designer perspective, the memory provider is the most questionable part of the scalable pools API. And since it is yet a community preview feature, you are welcome to influence its design. Curious readers might want to ask questions like the following:</p>
<ul>
<li>what are the requirements for the template argument?</li>
<li>why is std::allocator used as a memory provider?</li>
<li>why the type used with std::allocator in examples above is “char”?</li>
</ul>
<p style="text-align: justify;">The template argument of tbb::memory_pool accepts a memory provider class which satisfies minimal requirements of STL compatible allocator according to the last C++11 standard: <strong>allocate </strong>and<strong> deallocate</strong> methods, and a <strong>value_type</strong> definition.</p>
<p style="text-align: justify;">Using std::allocator and compatible classes is perhaps the most straight-forward way to enable memory_pool anywhere. However from efficiency standpoint, it makes probably not much sense because such allocators are intended for rather small objects by design while memory provider should operate with megabytes. For users who don’t care what the memory provider is, we could better provide a default one instead which would map to system-default way for memory mapping.</p>
<p style="text-align: justify;">And finally, TBB memory pools don’t really need the type of allocation (i.e. <strong><em>char</em></strong> in the declaration of tbb::memory_pool&lt;std::allocator&lt;<strong>char</strong>&gt;&gt;), but rather need to know the granularity of requests to the memory provider. And this is not only specification for type of arguments for allocate and deallocate, this information is used in our implementation to determine size of memory requests to memory provider. For example, consider big pages which can be mapped only by chunks of megabytes:</p>
<pre name="code" class="cpp">// A custom memory provider for memory_pool
class big_pages {
public:
    typedef char[2*1024*1024] value_type;
    void *allocate(size_t pages) {
        return mmap(0x0UL, pages*2*1024*1024, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);
    }
    // the pointer type requirement is also actually relaxed
    void deallocate(void *ptr, size_t pages) {
        munmap(ptr, pages*2*1024*1024);
    }
};
// usage:
tbb::memory_pool&lt;big_pages&gt; my_pool;</pre>
<h2>Some food for thoughts</h2>
<p>The way granularity is specified in the line 4 in the above example is not straight-forward and can be viewed as confusing. This is the price of STL-compliant interface of the memory provider and we are not sure if it has more pros than cons:</p>
<ul>
<li>STL compatibility is supposed to reuse widely implemented memory allocators.
<ul>
<li>On the other hand, these allocators are usually purposed for small sizes of allocations but a pool will need memory chunks of at least hundreds of kilobytes.</li>
</ul>
</li>
<li>In theory, it allows easy nesting of memory pools using our memory_pool_allocator class.
<ul>
<li>But we studied that nesting of the pool in some other implementations does not mean reusing the memory allocated by parent pool but rather a hierarchy of pool objects.</li>
<li>And such a nesting is not yet supported anyway</li>
</ul>
</li>
<li>It is easier to remember the requirements based on well-known standard interface</li>
<li>Granularity is a property of the memory provider and must be passed along with it</li>
</ul>
<p style="text-align: justify;">As an alternative interface, we consider to make the granularity explicitly specified but in a separate trait class which should be specialized only for the memory providers with granularity of allocations &gt; 1. It is even possible to keep STL-compatibility using metaprogramming magic, e.g. define the granularity to sizeof(value_type) if value_type defined.</p>
<p style="text-align: justify;">Another question is how to introduce alignment in the interface of memory pools. Basically, it can be either aligned_malloc() and aligned_realloc(), or an optional argument for malloc() and realloc() methods.</p>
<p>Also, are the suggested class names good, or do we need to find better names (for instance, "fixed_region" and "dynamic_region" to align with terms of Wikipedia)?</p>
<h2>Feedback is very welcome</h2>
<p>We are very eager to hear from you what do you think about above and how can it be used in your projects.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/12/19/scalable-memory-pools-community-preview-feature/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>My 5 Favorite New Intel® Software Development Product Features of 2011</title>
		<link>http://software.intel.com/en-us/blogs/2011/12/16/my-5-favorite-new-intel-software-development-product-features-of-2011/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/12/16/my-5-favorite-new-intel-software-development-product-features-of-2011/#comments</comments>
		<pubDate>Fri, 16 Dec 2011 18:41:39 +0000</pubDate>
		<dc:creator>Shannon Cepeda (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Intel Cilk Plus]]></category>
		<category><![CDATA[Intel Cluster Studio XE]]></category>
		<category><![CDATA[Intel Software Development Products]]></category>
		<category><![CDATA[Intel VTune Amplifier XE]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[TBB 4.0]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/12/16/my-5-favorite-new-intel-software-development-product-features-of-2011/</guid>
		<description><![CDATA[It's been a big year for us in the Intel Developer Products Division. We released Intel® Cluster Studio XE and Intel® Parallel Studio XE Service Pack 1. We continued to plan and design our products to provide support for the compute continuum. And of course we worked to grow our community of developers. Throughout the [...]]]></description>
			<content:encoded><![CDATA[<p>It's been a big year for us in the Intel Developer Products Division. We released <a href="http://software.intel.com/en-us/articles/intel-cluster-studio-xe/">Intel® Cluster Studio XE</a> and <a href="http://software.intel.com/en-us/articles/intel-parallel-studio-xe/">Intel® Parallel Studio XE Service Pack 1</a>. We continued to plan and design our products to provide support for the compute continuum. And of course we worked to grow our community of developers. Throughout the year there have been several new features and developments in some of my favorite products - below I list my personal top 5 and tell you why. This list is of course heavily biased by my particular area of expertise (performance) and is by no means a complete list of all the new products or features that went into Intel® Software Development products in 2011!  So, without further ado, my favorites:</p>
<p>5. <a href="http://software.intel.com/en-us/articles/intel-cilk-plus-open-source/">Intel® Cilk Plus open source port to GCC</a> - <a href="http://software.intel.com/en-us/articles/intel-cilk-plus/">Intel® Cilk Plus</a> was announced in 2010, and an open source specification has been out since late 2010 as well. However this year we began, along with the open source community, to port Cilk Plus to GCC. Some of the first items ported were the parallelism keywords, which is significant to me because it makes our Cilk Plus parallelism model available to a greater audience.</p>
<p>4. <a href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe">Intel® VTune™ Amplifier XE</a> and <a href="http://software.intel.com/en-us/articles/intel-inspector-xe/">Intel® Inspector XE</a> MPI Support - In the new Cluster Studio XE product, VTune Amplifier XE and Inspector XE are now MPI-enabled. This is important because we are beginning to see more hybrid programming in the HPC and cluster world - which means the applications use a combination of MPI and another threading model (such as OpenMP, Cilk Plus, or <a href="http://software.intel.com/en-us/articles/intel-tbb/">Intel® Threading Building Blocks</a>). We have an existing product, <a href="http://software.intel.com/en-us/articles/intel-trace-analyzer/">Intel® Trace Analyzer and Collector</a>, that analyzes MPI efficiency for a cluster app, but analyzing performance of an individual process running on an MPI rank was more difficult. Now we make it easier to use VTune Amplifier XE or Inspector XE to analyze the threading model used within a rank, which helps us support more cluster customers. </p>
<p>3. <a href="http://drdobbs.com/tools/231900177">Intel® Threading Building Blocks Flow Graph</a> - I was introduced to flow graph this year, when I worked with my colleague Victoria Gromova to create some TBB labs for Intel Developer Forum. Victoria wanted to highlight flow graph as one of the new features of <a href="http://threadingbuildingblocks.org/">TBB 4.0</a>. Flow graph is a new construct that supports many more types of control algorithms, like dependency graphs, event-based models or reactive-based flows. In short, it opens up TBB to more customers while maintaining or improving the TBB performance we have come to expect. </p>
<p>2. <a href="http://software.intel.com/en-us/articles/intel-parallel-studio-xe/#whatsnew">VTune Amplifier XE attach to running process on Linux*</a> - This is a great example of our development team responding to customer feedback. Being able to analyze a running process for a defined period of time (instead of launching it) has been requested by many of our clients. We first got this implemented on Windows*, then this September provided the feature for Linux* in <a href="http://softtalkblog.com/2011/09/13/intel-parallel-studio-xe-2011-service-pack-1-is-released/">Intel® Parallel Studio XE Service Pack 1</a>. I have already been visiting some users who requested this and it is great to be able to share that the feature they have been asking for is here!</p>
<p>1. <a href="http://software.intel.com/en-us/blogs/2011/06/27/what-weve-been-doing-to-make-performance-analysis-easier-on-intel-microarchitecture-codename-sandy-bridge/">VTune Amplifier XE interface for Intel® Microarchitecture Codename Sandy Bridge</a> - For readers of my blog this one should not be a surprise! I have created <a href="http://software.intel.com/en-us/articles/two-part-webinar-and-two-videos-posted-all-covering-sandy-bridge-performance-tuning/">quite a bit of training material </a>on these new Sandy Bridge features. We now provide an analysis type for Sandy Bridge that helps users easily identify the most common software performance issues at the microarchitectural level, and it includes pre-coded metrics, thresholds, and issue highlighting for usability. This is my favorite new feature because, even though I am not a developer, I got to help a little with making this interface by helping define some performance metrics and thresholds and validating them on workloads. It is very cool to see my contributions in the product.</p>
<p>There you have it! I hope you have a chance to try out some of our new product features now or in the coming year. Let us know your favorites, or your requests.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/12/16/my-5-favorite-new-intel-software-development-product-features-of-2011/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Open Parallel: Optimizing Web Performance with TBB</title>
		<link>http://software.intel.com/en-us/blogs/2011/11/16/open-parallel-optimizing-web-performance-with-tbb/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/11/16/open-parallel-optimizing-web-performance-with-tbb/#comments</comments>
		<pubDate>Wed, 16 Nov 2011 22:39:36 +0000</pubDate>
		<dc:creator>Nicolas Erdody</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Performance and Optimization]]></category>
		<category><![CDATA[Power Efficiency]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[HipHop]]></category>
		<category><![CDATA[Intel Software Partner Program]]></category>
		<category><![CDATA[James Reinders]]></category>
		<category><![CDATA[multi-core]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[Perl]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/11/16/open-parallel-optimizing-web-performance-with-tbb/</guid>
		<description><![CDATA[Open Parallel is a research and development company that focuses on parallel programming and multicore development. We are a bunch of highly skilled geeks from various backgrounds that work together on problems in parallel programming and software development for multicore and manycore platforms. At LinuxConf (LCA2010) James Reinders gave a talk about the Threading Building [...]]]></description>
			<content:encoded><![CDATA[<p><strong><a href="www.openparallel.com">Open Parallel</a></strong> is a research and development company that focuses on parallel programming and multicore development. We are a bunch of highly skilled geeks from various backgrounds that work together on problems in parallel programming and software development for multicore and manycore platforms.</p>
<p>At LinuxConf (LCA2010) <strong>James Reinders</strong> gave a talk about the Threading Building Blocks (<a href="http://threadingbuildingblocks.org/">TBB</a>) library, a C++ threading library that sets out to make multicore programming more accessible to the average programmer. We took this idea on board and explored the possibilities of opening up this approach to an even wider audience, namely the audience of web application developers working in script languages.</p>
<p>Many websites require a non-trivial amount of per-request processing in the application layer, perhaps to retrieve, consolidate or otherwise manipulate data. Achieving better performance at this level improves response times and the overall user experience. Even when processing time at application level is not critical, parallelizing access to database and web service back-end layers can yield substantial improvements in perceived performance.</p>
<p>This drove our goal of adding TBB support into <strong>PHP</strong> and <strong>Perl</strong>, starting with <strong><a href="http://en.wikipedia.org/wiki/HipHop_%28software%29">HipHop</a></strong> as the PHP implementation of choice and later on adding Perl support to the game.</p>
<p>HipHop is a PHP to C++ cross compiler that was developed by Facebook to cut down on resource needs and speed up the execution times of their gigantic web infrastructure that was started on a classic PHP/MySQL stack and now has to scale to hundreds of millions of users. The HipHop project is a PHP implementation that is thread safe and already uses TBB for some memory management. We started by extending the existing support and added first only the new *parallel_for* function. Later, we added concurrent data structures and re-implemented our first approach.</p>
<p>What we have now is a robust implementation of *parallel_for* and *parallel_reduce* with the data structures needed to support them. What we learned on the way was both, very enlightening and quite frustrating at times. Our aim to make TBB more widely accessible was reached by getting the language extension into HipHop but we also tried to get it into Zend PHP. This turned out to only work with a language compatibility module that does not provide the full glory we can offer on the HipHop platform. The reason for this is the architecture of the PHP interpreter.</p>
<p>Implementing threading into language interpreters turns out to be very hard. There are two dormant/failed approaches in Perl and every attempt in PHP has failed so far. The core developers on both sides are very much in doubt if it is a path worth going down at all. The problem is global locking and copying/sharing of data structures that are thread local. Our Perl implementation is a starting point that could influence not only the Perl community but other interpreter designers and interpreter developers as well.</p>
<p>In the Perl community we are trying to lobby for a const keyword that would lock a data structure and remove the need to copy it into every thread. The ability to make something immutable is missing in Perl and PHP and this makes the startup cost of any worker thread very expensive. For the Perl library we wrote a lazy clone module that would only clone a data structure if the worker thread really accesses it. That way we only penalize the worker thread for accessing data - we can possibly get around cloning structures at all if they are not accessed within this task.</p>
<p>In our work with the PHP HipHop compiler we also wrote a patch set for WordPress and enhanced WordPress with our new *parallel_for* language extension. This trial brought us instant success in reduced page load times. The patch set for WordPress only replaced some key *foreach* loops with *parallel_for* and was our first real success with the TBB library in PHP. Based on that success we started out to re-implement our initial approach and tidy up our patch set for HipHop to make it more accessible to others.</p>
<p>The Perl project worked towards a Perl module that can be used to get access to TBB functions directly. We also started out to implement the core memory structures and then built on top of those the *parallel_for* functionality. The module we have now is stable enough to demonstrate the gains we can get by using TBB in Perl.</p>
<p>To round the project off we implemented two little tools as real world demo and as working code to look at. The demo is based around the HTML5 geo tag which is present in modern browsers and can be read with a Javascript API. In the HipHop version we use it to read the current Lat/Lon from the accessing browser and then parse the Twitter firehose to find tweets with embedded image URLs.</p>
<p>In the Perl demo we query Flickr and fetch a grid of 4x4 images, cache them locally and then render one big image out of scaled versions of the single images. The demos are running on <strong><a href="http://geopic.me">geopic.me</a></strong></p>
<p>To sum up our experience with TBB and script languages we know now that threading interpreters buries its very own set of challenges but we were able to get further than others did on the same mission by using TBB. The libraries we produced so far - which are open source and can be found on <strong>our <a href="https://github.com/openparallel/">github</a> account</strong> - will be further developed and maintained.</p>
<p>We will continue working on both platforms to expose the power of multicore CPUs to developers in an approachable way. Along the way we also produced a number of more detailed white papers covering various aspects of the project:</p>
<p>* <strong><a href="http://openparallel.com/2011/05/11/threading-perl-using-tbb-the-cpan-module-and-white-paper/">threads::tbb</a></strong><br />
* <strong><a href="http://openparallel.files.wordpress.com/2010/09/tbb-in-wordpress-oct-10.pdf">TBB in WordPress</a></strong><br />
* <strong><a href="http://openparallel.files.wordpress.com/2010/09/wordpress-on-hiphop-nov-10.pdf">WordPress on HipHop</a></strong></p>
<p>Get in touch if you are interested in these projects or have questions about the work we did. There is further information on our website <strong><a href="www.openparallel.com">OpenParallel.com</a></strong></p>
<p>Contact: <strong><a href="http://openparallel.com/contact-us/">Nicolas Erdody</a></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/11/16/open-parallel-optimizing-web-performance-with-tbb/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to make a pipeline with an Intel® Threading Building Blocks flow graph</title>
		<link>http://software.intel.com/en-us/blogs/2011/09/14/how-to-make-a-pipeline-with-an-intel-threading-building-blocks-flow-graph/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/09/14/how-to-make-a-pipeline-with-an-intel-threading-building-blocks-flow-graph/#comments</comments>
		<pubDate>Wed, 14 Sep 2011 16:20:14 +0000</pubDate>
		<dc:creator>Michael Voss (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[flow_graph]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/09/14/how-to-make-a-pipeline-with-an-intel-threading-building-blocks-flow-graph/</guid>
		<description><![CDATA[The Intel® Threading Building Blocks ( Intel® TBB ) flow graph is fully supported in Intel TBB 4.0. If you are unfamiliar with the flow graph, you can read an introduction here. A question was recently submitted about an implementation of a pipeline using a flow graph.  That question made me realize that pointing out some of [...]]]></description>
			<content:encoded><![CDATA[<p>The Intel® Threading Building Blocks ( Intel® TBB ) flow graph is fully supported in Intel TBB 4.0. If you are unfamiliar with the flow graph, you can read an introduction <a href="http://software.intel.com/en-us/blogs/2011/09/08/the-intel-threading-building-blocks-flow-graph-is-now-fully-supported/">here</a>.</p>
<p><a href="http://software.intel.com/en-us/forums/showthread.php?t=85880&amp;o=a&amp;s=lr">A question was recently submitted about an implementation of a pipeline using a flow graph</a>.  That question made me realize that pointing out some of the subtle differences between an Intel TBB pipeline and an Intel TBB flow graph might prevent some confusion among users.</p>
<p>As a running example, I will convert the pipeline example that ships with the Intel TBB distribution, <code>examples/pipeline/square</code>. This example uses the three-stage pipeline shown in Figure 1.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/square_pipeline.png"><img class="aligncenter size-full wp-image-36214" title="square_pipeline" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/square_pipeline.png" alt="" width="441" height="75" /></a></p>
<p><strong>Figure 1: The three-stage pipeline using in examples/pipeline/square.</strong></p>
<p>In the square example, the <code>input</code> filter reads blocks of strings from a input file.  Each block is of a fixed size and when a complete block is read, it is sent to the next filter <code>transform</code>. The second filter is a parallel filter that allocates an output buffer, converts each string in the input buffer into a <code>long</code>, squares the value, and then writes the result to the output buffer. It passes the output buffer to the final filter, <code>output</code>, which is a serial-in-order filter. The <code>output</code> filter processes buffers one at a time in the order that they were created by the <code>input</code> filter. It dumps the resulting squared numbers to an output file. Because the buffers are processed in-order, the values written to the output file match the ordering of their corresponding values in the input file.</p>
<p>Figure 2 shows a straightforward conversion of this example to a flow graph. The input filter becomes a source_node, the second filter becomes a <code>function_node</code> with unlimited concurrency, and the final filter becomes a <code>function_node</code> with a concurrency of 1.   For the most part, the rectangles have become circles :)   But unfortunately, this implementation will both generate incorrect results and perform poorly.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/bad_square_graph.png"><img class="aligncenter size-full wp-image-36218" title="bad_square_graph" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/bad_square_graph.png" alt="" width="314" height="135" /></a></p>
<p><strong>Figure 2: A straightforward (but incorrect) translation to a flow graph.</strong></p>
<p>First, let's address the correctness issue. The final node in our flow graph has a concurrency of 1, and so is serial; however it operates on the buffers in first-in-first-out order. The pipeline version, on the other hand, used a serial-in-order-filter, which allows it to re-establish the order in which the items were generated by the <code>input</code> filter. We can mimic this behavior by adding a sequencer node, as shown in Figure 3.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/slow_square_graph.png"><img class="aligncenter size-full wp-image-36224" title="slow_square_graph" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/slow_square_graph.png" alt="" width="377" height="142" /></a></p>
<p><strong>Figure 3: A correct (but slow) translation to a flow graph.</strong></p>
<p>Adding a <code>sequencer_node</code> to our flow graph requires that we assign a sequence number to each buffer as it is allocated at the <code>input</code> filter. We also then must write a function that can return that sequence value for given a buffer, and provide this function to the <code>sequencer_node</code>.</p>
<p>The flow graph shown in Figure 3 still has a problem. The <code>source_node input</code> is directly attached to a <code>function_node transform</code> that has unlimited concurrency. This means that <code>transform</code> will accept everything that <code>input</code> sends to it. The <code>input</code> node will happily keep allocating new buffers, flooding the flow graph with inputs. In the best case, this will bog down the system resulting in poor performance. In the worst case, this could cause the application to run out of memory and crash.</p>
<p>Such a situation is not possible in an Intel TBB pipeline because of its token-based scheduling mechanism. When a pipeline is run, the user provides a fixed number of tokens to the pipeline. There will be at most one item per token in the pipeline. In the square example, the pipeline is run with 4 tokens per core:</p>
<p><code>pipeline.run( nthreads*4 );</code></p>
<p>A flow graph does not have token-based scheduling. Because nodes in a flow graph can generate multiple outputs, join messages together and split messages apart, it becomes difficult if not impossible to implement a token-based system that does not provide the potential for deadlock. However, the flow graph does provide alternative methods for restricting memory use. One of these methods is the use of a limiter_node, as shown in Figure 4.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/square_graph.png"><img class="aligncenter size-full wp-image-36230" title="square_graph" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/square_graph.png" alt="" width="482" height="197" /></a></p>
<p><strong>Figure 4: A correct and efficient translation to a flow graph.</strong></p>
<p>In Figure 4, a <code>limiter_node</code> is inserted between the <code>input</code> filter and the <code>transform</code> filter. A <code>limiter_node</code> has a user-set threshold of messages that it will allow to pass through before rejecting additional messages. It also has a second input port that can be used to decrement its internal count of messages it has forwarded. If the <code>limiter_node</code> in Figure 4 is constructed with a threshold of <code>4*nthreads</code> it will cap memory use similarly to the pipeline implementation. The edge from the <code>output</code> filter back to the <code>limiter_node</code> will signal the node to allow additional inputs in from the <code>input</code> filter.</p>
<p>Although it is not demonstrated in the square example, a pipeline can also have a parallel input filter.  In contrast, a <code>source_node</code> is always serial.  This restriction is again due to the lack of token-based scheduling in the flow graph.  The number of active copies of a pipeline’s input filter is bounded by the number of available tokens.   In a flow graph there would be no such bound, and no practical way to determine an appropriate number of <code>source_node</code> bodies to activate concurrently.    There are ways to mimic the behavior of a parallel input filter in a flow graph, but I’ll leave that discussion for a future blog post.</p>
<p>Below is the snippet of code from <code>examples/pipeline/square</code> that sets up the structure of the Intel TBB pipeline shown in Figure 1. Each of the three filters is created and then added to the pipeline. Finally the pipeline is run with <code>4*nthreads</code> tokens.</p>
<p><code>tbb::pipeline pipeline;</code></p>
<p><code>MyInputFilter input_filter( input_file );</code><br />
<code>pipeline.add_filter( input_filter );</code></p>
<p><code>MyTransformFilter transform_filter;</code><br />
<code>pipeline.add_filter( transform_filter );</code></p>
<p><code>MyOutputFilter output_filter( output_file );</code><br />
<code>pipeline.add_filter( output_filter );</code></p>
<p><code>pipeline.run( nthreads*4 );</code></p>
<p>Code that can be used to set up the structure of the Intel TBB flow graph shown in Figure 4 is shown below. Each of the nodes is created and then edges are added between them. Note that the <code>limiter_node</code> is passed, <code>4*nthreads</code> as its threshold, and that a <code>sequencer_node</code> is placed before <code>output</code>. Finally, the <code>source_node</code> is activated and then <code>wait_for_all</code> is called on the flow graph to wait for it to complete.</p>
<p><code>tbb::flow::graph g;</code></p>
<p><code>tbb::flow::limiter_node limiter( g, nthreads*4 );</code><br />
<code>tbb::flow::sequencer_node&lt; TextSlice * &gt; sequencer(g, sequencer_body() );</code></p>
<p><code>tbb::flow::source_node input( g, MyInputFilter(input_file), false );</code><br />
<code>tbb::flow::function_node transform( g, tbb::flow::unlimited, MyTransformFilter() );</code><br />
<code>tbb::flow::function_node output( g, tbb::flow::serial, MyOutputFilter( output_file ) );</code></p>
<p><code>tbb::flow::make_edge( input, limiter );</code><br />
<code>tbb::flow::make_edge( limiter, transform );</code><br />
<code>tbb::flow::make_edge( transform, sequencer );</code><br />
<code>tbb::flow::make_edge( sequencer, output );</code><br />
<code>tbb::flow::make_edge( output, limiter.decrement );</code></p>
<p><code>input.activate();</code><br />
<code>g.wait_for_all();</code></p>
<p>The flow graph code is clearly more verbose than the pipeline version, but their performance will be comparable. As described <a href="http://software.intel.com/en-us/blogs/2011/09/08/the-intel-threading-building-blocks-flow-graph-is-now-fully-supported/">here</a>, applications that are structured like pipelines will be more easily expressed using the Intel TBB pipeline, but the Intel TBB flow graph has a more flexible API and therefore can express applications that are not amenable to a linear pipeline.</p>
<p>In conclusion, there are subtle but important differences between an Intel TBB pipeline and an Intel TBB flow graph. First, pipeline supports serial-in-order nodes, while the flow graph requires a <code>sequencer_node</code> to get similar behavior.  Second, and more importantly, a pipeline uses token-based scheduling while a flow graph does not. The flow graph API does include alternatives to token-based scheduling such as the <code>limiter_node</code> discussed in this post.  In an earlier post,<br />
<a title="Permanent Link: A feature-detection example using the Intel® Threading Building Blocks flow graph" rel="bookmark" href="http://software.intel.com/en-us/blogs/2011/09/09/a-feature-detection-example-using-the-intel-threading-building-blocks-flow-graph/">A feature-detection example using the Intel® Threading Building Blocks flow graph</a>, I demonstrated another alternative for managing memory use, a reserving join node.  And finally, because of the lack of token-based scheduling, a source_node is always serial and more complex methods must be used to mimic a pipeline’s parallel input filter.</p>
<p>To learn more about other features in Intel® Threading Building Blocks 4.0, visit <a href="http://www.threadingbuildingblocks.org">http://www.threadingbuildingblocks.org</a> or to learn more about the Intel® TBB flow graph, check-out the other blog articles at <a href="http://software.intel.com/en-us/blogs/tag/flow_graph/">http://software.intel.com/en-us/blogs/tag/flow_graph/</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/09/14/how-to-make-a-pipeline-with-an-intel-threading-building-blocks-flow-graph/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Intel® TBB 4.0 features to simplify Dining Philosophers</title>
		<link>http://software.intel.com/en-us/blogs/2011/09/13/using-intel-tbb-40-features-to-simplify-dining-philosophers/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/09/13/using-intel-tbb-40-features-to-simplify-dining-philosophers/#comments</comments>
		<pubDate>Tue, 13 Sep 2011 16:00:43 +0000</pubDate>
		<dc:creator>Christopher Huson (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[flow_graph]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/09/13/using-intel-tbb-40-features-to-simplify-dining-philosophers/</guid>
		<description><![CDATA[Intel recently released the 4.0 version of Intel® Threading Building Blocks (Intel® TBB), in which most of the flow::graph Community Preview features from 3.0 have been made standard features, and some new nodes have been added as Community Preview features. The time has come to revisit the Dining Philosophers program. One of the new Community [...]]]></description>
			<content:encoded><![CDATA[<p>Intel recently released the 4.0 version of Intel® Threading Building Blocks (Intel® TBB), in which most of the <code>flow::graph</code> Community Preview features from 3.0 have been made standard features, and some new nodes have been added as Community Preview features.  The time has come to revisit the Dining Philosophers program.</p>
<p>One of the new Community Preview nodes is the <code>multioutput_function_node</code>.  This node may be connected to one or more (currently up to 10) ports.  The ports are passed as a <code>std::tuple</code> to each execution of the node’s functor, and during each execution the functor may send items to any combination of output ports or to none at all.</p>
<p>This flexibility lets us change the design of the Dining Philosophers a bit.  <a href="http://software.intel.com/en-us/blogs/2011/01/10/using-the-intel-threading-building-blocks-graph-community-preview-feature-an-implementation-of-dining-philosophers/">In the current implementation </a>each philosopher has lots of state (it knows its left and right chopstick queues, its join and the graph that contains it.)  Because <code>function_nodes</code> have only one output, which is always sent to, each philosopher had to explicitly put the chopsticks back to the proper queues.  In the original version of Dining Philosophers the join node for each philosopher was created and added to the graph after they finish thinking the first time.</p>
<p>The new version of the philosopher is simpler.  The philosophers now only know about thinking, eating and putting chopsticks back.</p>
<p>The structure of the new graph is:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/dp_ver2.bmp"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/dp_ver2.bmp" alt="graph for Dining Philosophers in 4.0" title="dp_ver2" class="aligncenter size-full wp-image-36160" /></a></p>
<p>How it works:</p>
<ol>
<li>The philosopher thinks (the first <code>function_node</code>), then emits a <code>continue_msg</code>.</li>
<li>The reserving join checks if it has an item at each input.  If it does (the <code>continue_msg</code> plus the left and right chopstick), it assembles them into a <code>std::tuple</code> and forwards it to the eat() <code>multioutput_function_node</code>.</li>
<li>The <code>multioutput_function_node</code> eats, then puts its chopsticks back to their queues.  If the philosopher has not completed M (currently 10) rounds of eating and thinking, it sends a <code>continue_msg</code> to the think() <code>function_node</code>.  If the philosopher has completed M rounds, it does not emit the <code>continue_msg</code>, and no longer thinks or eats.</li>
</ol>
<p>The graph is constructed with N philosophers, and then a <code>continue_msg</code> is sent to each think() <code>function_node</code> to start the graph.  Here is the code for the main loop:</p>
<pre name="code" class="cpp">// create queues of (one) chopstick
chopstick_places_vector_type places(num_philosophers, tbb::flow::queue_node<chopstick>(g));
for ( int i = 0; i < num_philosophers; ++i ) {
    places[i].try_put(chopstick());
}

p_vector philosophers;
// must reserve the vector so no reallocation occurs (we're passing references to the vector elements)
philosophers.reserve(num_philosophers);
for ( int i = 0; i < num_philosophers; ++i )  philosophers.push_back( philosopher( names[i] ) );
    // push_back allowed because the philosopher objects can be assigned

// create think function_nodes
think_node_vector_type think_nodes;
think_nodes.reserve(num_philosophers);
for(int i=0; i < num_philosophers; ++i)
    think_nodes.push_back(new think_node_type(g, tbb::flow::unlimited, think_node_body(philosophers[i])));

// done queues (holds continue_msg)
thinking_done_vector done_vector(num_philosophers, thinking_done(g));

// create join nodes
join_node_vector_type j_vector(num_philosophers,join_node_type(g));

// attach chopstick buffers and think function_nodes to joins
for(int i = 0; i < num_philosophers; ++i) {
    think_nodes[i]->register_successor(done_vector[i]);
    done_vector[i].register_successor(tbb::flow::input_port<0>(j_vector[i]));
    places[i].register_successor(tbb::flow::input_port<1>(j_vector[i])); // left chopstick
    places[(i+1) % num_philosophers].register_successor(tbb::flow::input_port<2>(j_vector[i]));  // right chopstick
}
// create eat multioutput_function_nodes
eat_node_vector_type eat_nodes;
eat_nodes.reserve(num_philosophers);
for(int i = 0; i < num_philosophers; ++i) {
    eat_nodes.push_back( new eat_node_type(g, tbb::flow::unlimited, eat_node_body(philosophers[i])));
    // attach join to mofns
    j_vector[i].register_successor(*(eat_nodes[i]));
    // attach mofns to think function_nodes
    tbb::flow::output_port<0>(*(eat_nodes[i])).register_successor(*(think_nodes[i]));
    // attach mofns to chopstick queues
    tbb::flow::output_port<1>(*(eat_nodes[i])).register_successor(places[i]);
    tbb::flow::output_port<2>(*(eat_nodes[i])).register_successor(places[(i+1) % num_philosophers]);
}

// start all the philosophers thinking
for(int i = 0; i < num_philosophers; ++i) think_nodes[i]->try_put(tbb::flow::continue_msg());

g.wait_for_all();

for(int i = 0; i < num_philosophers; ++i) {
    delete think_nodes[i];
    delete eat_nodes[i];
}</pre>
<p>There are a couple points to note:</p>
<ul>
<li>A <code>queue_node</code> (created as a vector of <code>queue_node</code>s) is needed between the think() <code>function_node</code> and the reserving join, because the <code>function_node</code> is not reservable (if no successor accepts its output, that output is dropped.)  The <code>queue_node</code> is reservable.  (The reserving join does not buffer at its input ports, because if the join does not accept an input it must be available for other nodes to take.)</li>
<li>Both the think() <code>function_node</code> vector (allocated at line 14) and the eat() <code>multioutput_function_node</code> vector (at line 33) must be pointers to the nodes.  This is because <code>flow::graph</code> nodes are generally not assignable, and a push_back() to a vector involves an assignment.  The nodes which do not need to be constructed one-by-one (the queues at lines 2 and 20, and the join at line 23) can be created by the <code>std::vector</code> constructor, because it uses copy-construction of its elements.</li>
</ul>
<p>The <code>multioutput_function_node</code> is a vital addition to TBB <code>flow::graph</code>; I have found I use it in most of the graphs I have written since its debut.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/09/13/using-intel-tbb-40-features-to-simplify-dining-philosophers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Buffering Nodes in Graphs in Intel® Threading Building Blocks</title>
		<link>http://software.intel.com/en-us/blogs/2011/09/12/using-buffering-nodes-in-graphs-in-intel-threading-building-blocks/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/09/12/using-buffering-nodes-in-graphs-in-intel-threading-building-blocks/#comments</comments>
		<pubDate>Mon, 12 Sep 2011 15:59:53 +0000</pubDate>
		<dc:creator>Terry Wilmarth (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[flow_graph]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/09/12/using-buffering-nodes-in-graphs-in-intel-threading-building-blocks/</guid>
		<description><![CDATA[When using the new flow graph in Intel® Threading Building Blocks (Intel® TBB), we often encounter a situation where a sending node is ready to output a data item, but no receiving nodes are ready to receive that data item. Sending nodes have differing semantics in this regard, but they often just throw data away [...]]]></description>
			<content:encoded><![CDATA[<p>When using the new flow graph in Intel® Threading Building Blocks (Intel® TBB), we often encounter a situation where a sending node is ready to output a data item, but no receiving nodes are ready to receive that data item.  Sending nodes have differing semantics in this regard, but they often just throw data away when no receiver accepts it.  In some cases, we want to process every piece of data generated by a sender.  This requires that the sender have as successor a node that always accepts data items.  Further, that receiver must buffer the data items that it receives, and hand them out one at a time to receivers when those receivers are ready for them.</p>
<p>Intel® TBB provides four graph nodes that do unbounded buffering. They are:</p>
<ul>
<li><em>buffer_node</em>: this type of node simply buffers items, and hands them to receivers in no particular order.</li>
<li><em>queue_node</em>: this type of node buffers items, and hands them to receivers in FIFO order.</li>
<li><em>priority_queue_node</em>: this type of node buffers items that have a priority value associated with them, and hands them to receivers in highest-priority-first order.</li>
<li><em>sequencer_node</em>: this type of node buffers items that have a sequence number associated with them, and hands them to receivers in sequence number order;  while the next item in the sequence is unavailable, the <em>try_get</em> operation fails.</li>
</ul>
<p>Another possible default sending behavior of some nodes is that a data item is broadcast to multiple receivers, and if multiple receivers are available, this effectively duplicates the data item. We may require that every piece of data generated by a sending node be handled by at most one receiving node.  In this case, we can place a single buffering node after the sender to prevent the duplication of data items.</p>
<p><strong>Connecting buffering nodes with other graph nodes</strong></p>
<p>If we connect a buffering node as a successor to another graph node, the buffering node always accepts anything that is sent to it.  Thus, the underlying behavior is that the predecessor to the buffering node calls <em>try_put</em> on the buffering node whenever it has output to send.  In other words, the predecessor always <em>pushes</em> data to the buffering node.  The buffering node never goes into a <em>pull</em> state, that is, it never needs to call <em>try_get</em> on its predecessor.</p>
<p>On the output side of buffering nodes, the nature of the relationship between the buffering node and its successors can alternate between the <em>push</em> and <em>pull</em> states over time.  When successors to a buffering node are idle, they are registered as successors of the buffering node, so that when the buffer receives content, the buffering node pushes that content to a waiting successor with a <em>try_put</em> operation.  If all successors to a buffering node are busy, they may not be registered with the buffering node.  In this case, when a successor becomes free, it tries to <em>pull</em> content from the buffering node via a <em>try_get</em> operation.  Should this operation fail, the successor registers itself with the buffering node so that it can receive content later when it is available.  More detail on alternating push/pull behavior is available in Mike Voss’ s blog <a href="http://software.intel.com/en-us/blogs/2011/05/26/understanding-the-internals-of-tbbgraph-balancing-push-and-pull/">here</a>.</p>
<p>The behaviors described above are the implicit behavior of connections made with buffering nodes via the <em>make_edge</em> function.  Other nodes may also explicitly call <em>try_put</em> and <em>try_get</em> on buffering nodes to transfer content.  They can even make and sever temporary connections to the buffering nodes using the <em>register</em> and <em>remove</em> operations described in the reference manual.  When such connections are made, the buffering nodes assume their role as described above: as an always-accepting successor and as a changeable push/pull predecessor as determined by content availability.</p>
<p><strong>Making reservations on buffering nodes</strong></p>
<p>Each of these nodes allows reservation: a successor to a buffering node can reserve an item and decide whether or not to use it later.  The procedure for reservation from the receiver’s perspective is the same for all buffering nodes: a receiver calls <em>try_reserve</em> on its predecessor, and if no item is available, the reserve operation fails. If a reservable item is available, it gets a copy of the item to examine.  Later, it may release the reservation by calling <em>try_release</em>, in which case the item stays in the buffer and is made available to other receivers, or it may consume the reservation by calling <em>try_consume</em>, in which case the item is removed from the buffer.  From the buffering node’s perspective there are a few differences:</p>
<ul>
<li>On a reserve, the <em>buffer_node</em> sets aside an item (not necessarily the same item as if a <em>try_get</em> was called instead of a <em>try_reserve</em>), and continues processing other <em>try_get</em> operations from other receivers, handing out non-reserved items from the buffer.</li>
<li><em>queue_node</em>s and <em>priority_queue_node</em>s reserve their first or highest priority item, respectively.  Get operations fail while a reservation is held.</li>
<li><em>sequencer_node</em>s check if the item with the next sequence number is available and if so, it is reserved.  If not, the reserve operation fails.  Here also, get operations fail while a reservation is held.</li>
</ul>
<p><strong>Example</strong></p>
<p>To illustrate the usage of buffering nodes, we provide a simple example that addresses the binpacking problem.  For this problem, we are given <em>N</em> objects with weights <em>v1,…,vN</em>.  The problem is to pack the objects into as few bins as possible, where each bin has capacity <em>V</em>.  This is an <em>NP</em>-hard problem, but there are many ways to achieve near-optimal solutions.</p>
<p>For our implementation, we used a <em>source_node</em> to generate the <em>N</em> objects, and these are placed in a value pool (a <em>queue_node</em>) where they wait until they can be processed.  The bulk of the algorithm is carried out by concurrent bin packers (<em>function_node</em>s).  Each bin packer is serial, and is attempting to pack its own bin using a best-fit greedy relaxation approach.  These nodes have three output modes: they may send out a completely or partially filled bin to a bin buffer (<em>buffer_node</em>), where the bin waits to be accounted for in a final phase, or they may reject the object that was received, returning it to the value pool, or they may output nothing, which implies that the object was added to the bin, but the bin is not yet ready.  Bins sent to the bin buffer are picked up by a final <em>function_node</em> that may output bin information and compile a summary of all the bins.  The graph for this binpacking algorithm thus looks like this:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/08/binpacker.jpg"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/08/binpacker.jpg" alt="" title="binpacker" width="635" height="347" class="aligncenter size-full wp-image-35225" /></a></p>
<p>The choice of a <em>queue_node</em> for the value pool was motivated by the fact that we did not want a bin packer to reject an object by putting it back into the pool, only to pick it up again immediately. Thus the FIFO ordering in <em>queue_node</em> would reduce the likelihood of this.  The bin buffer is simply a <em>buffer_node</em> since the order in which bins are collected in the final phase is irrelevant.</p>
<p>Below we show a code excerpt that sets up this graph. Note how the <em>make_edge</em> calls in the code below correspond to the solid black lines in the diagram.  The implicit connections are made by calls to <em>try_put</em> within the algorithm used by the <em>bin_packer</em> nodes (code not shown here). The graph nodes that are created below use certain function objects to accomplish their goals.  In particular, <em>the_source</em> uses an <em>item_generator</em>, <em>bin_packers</em> use <em>bin_fillers</em>, and <em>the_writer</em> uses <em>bin_printer</em>.  </p>
<p><code>typedef size_t value_type; // the type of items we are packing</code><br />
<code>typedef vector < value_type > bin; // a bin is a vector of value_types</code><br />
<code>// Bin_packers are function_nodes that receive value_type items; </code><br />
<code>// they implicitly send packed bins to the_bin_buffer, and return</code><br />
<code>// unused value_type items back to the_value_pool:</code><br />
<code>typedef function_node < value_type > bin_packer; </code><br />
<code>// the_value_pool is represented by a queue_node:</code><br />
<code>typedef queue_node < value_type > value_pool;</code><br />
<code>// the_bin_buffer is represented by a buffer_node:</code><br />
<code>typedef buffer_node < bin > bin_buffer;</code><br />
<code>// the_writer is represented by a function_node:</code><br />
<code>typedef function_node < bin, bin > bin_writer; </code><br />
<code>// the_source is represented by a source_node:</code><br />
<code>typedef source_node < value_type > value_source;</code><br />
<code>bin_packer **bins;   // the array of bin packers</code><br />
...<br />
<code>graph g; </code><br />
<code>value_source the_source(g, item_generator(), false);</code><br />
<code>value_pool the_value_pool(g);</code><br />
<code>make_edge(the_source, the_value_pool);</code><br />
<code>bin_buffer the_bin_buffer(g);</code><br />
<code>bins = new bin_packer*[num_bin_packers];</code><br />
<code>for (size_t i=0; i<num_bin_packers; ++i) {</code><br />
<code>    bins[i] = new bin_packer(g, 1, </code><br />
<code>        bin_filler(i, &#038;the_value_pool, &#038;the_bin_buffer));</code><br />
<code>    make_edge(the_value_pool, *(bins[i]));</code><br />
<code>}</code><br />
<code>bin_writer the_writer(g, 1, bin_printer());</code><br />
<code>make_edge(the_bin_buffer, the_writer);</code><br />
<code>the_source.activate();</code><br />
<code>g.wait_for_all();</code></p>
<p>The details of the implementation are available in the collection of Intel® TBB examples for flow graph usage that comes with Intel® TBB.  Visit <a href="http://threadingbuildingblocks.org">threadingbuildingblocks.org</a> to download.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/09/12/using-buffering-nodes-in-graphs-in-intel-threading-building-blocks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Flow Graph Feature in TBB 4.0 with Michael Voss - Parallel Programming Talk #121</title>
		<link>http://software.intel.com/en-us/blogs/2011/09/10/flow-graph-feature-in-tbb-40-with-michael-voss-parallel-programming-talk-121/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/09/10/flow-graph-feature-in-tbb-40-with-michael-voss-parallel-programming-talk-121/#comments</comments>
		<pubDate>Sun, 11 Sep 2011 00:38:22 +0000</pubDate>
		<dc:creator>Kathy Farrel (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Dr. Clay Breshears]]></category>
		<category><![CDATA[MIke Voss]]></category>
		<category><![CDATA[ParallelProgrammingTalk]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/09/10/flow-graph-feature-in-tbb-40-with-michael-voss-parallel-programming-talk-121/</guid>
		<description><![CDATA[It’s time for Parallel Programming Talk This is show  #121– Clay and I will be talking with Intel Software Architect Mike Voss But first the news: The News There is now an open source implementation of Cilk(TM) Plus based on gcc 4.7. Can you explain why this is good news?  Information on how to contribute is [...]]]></description>
			<content:encoded><![CDATA[<p><iframe src="http://blip.tv/play/g5FLgtHYbQA.html" width="640" height="360" frameborder="0" allowfullscreen></iframe><embed type="application/x-shockwave-flash" src="http://a.blip.tv/api.swf#g5FLgtHYbQA" style="display:none"></embed></p>
<p>It’s time for Parallel Programming Talk This is show  #121– Clay and I will be talking with Intel Software Architect Mike Voss</p>
<p>But first the news:</p>
<p><strong>The News</strong></p>
<p>There is now an open source implementation of Cilk(TM) Plus based on gcc 4.7. Can you explain why this is good news?  Information on how to contribute is at: <a href="http://software.intel.com/en-us/articles/contribute-to-intel-cilk-plus/">http://software.intel.com/en-us/articles/contribute-to-intel-cilk-plus/</a> Source is at: <a href="http://software.intel.com/en-us/articles/download-intel-cilk-plus-source/">http://software.intel.com/en-us/articles/download-intel-cilk-plus-source/</a> I’ll also include links to a couple of hot blogs on the subject – both are drawing high site traffic!</p>
<ul>
<li> <a href="http://software.intel.com/en-us/blogs/2011/08/09/parallelism-as-a-first-class-citizen-in-c-and-c-the-time-has-come/">http://software.intel.com/en-us/blogs/2011/08/09/parallelism-as-a-first-class-citizen-in-c-and-c-the-time-has-come/</a> </li>
<li><a href="http://software.intel.com/en-us/blogs/2009/08/03/hello-lambdas-c-0x-a-quick-guide-to-lambdas-in-c/">http://software.intel.com/en-us/blogs/2009/08/03/hello-lambdas-c-0x-a-quick-guide-to-lambdas-in-c/</a></li>
</ul>
<p> The Intel Developer Forum is less than a month away– September 13-15 at the Moscone Center in San Francisco <a href="http://www.intel.com/idf/">http://www.intel.com/idf/</a>  Clay and I have some things planned – we’ll  both be interviewing folks – a few Parallel Programming Talk shows will be done and we will have them available for viewing shortly thereafter.  ISN is doing something new at IDF – we are sponsoring a lab, being run by our newest Community Black Belt Developer Noah Clemons.  </p>
<p>                         <strong>  Faces of Parallelism Open Lab: Parallel Models for Multi/Many Core</strong></p>
<ul>
<li> Join this open lab whenever you want. It will be open from Wed, September 14 - 1:05 – 5:15pm. All levels of experience are welcome. You will experience Intel® Software Programming  Tools and the wide variety of programming models supported. This hands-on-lab provides attendees the opportunity to see how the latest Intel® silicon features are unlocked via Intel’s optimized Software Tools product line.  You will have the opportunity to provide feedback on your experience, by blog or video (we will provide everything needed)and those who choose to do so will be entered into a contest. Four prizes will be awarded. Don't miss this one.  
<p><strong>Read more about the lab, contest, etc. Session ID: </strong><a title="Session Detail" href="javascript:wwPopupIframe('modifySession.do?SESSION_ID=1529&amp;form=searchform&amp;ts=1314221577173','','','','');">SFTL004</a><strong>﻿</strong></li>
</ul>
<div><strong> Threading Challenge</strong> - <a href="http://software.intel.com/en-us/articles/intel-threading-challenge-2011-winners/">Intel® Threading Challenge 2011 Phase 1 Problem 2 Apprentice and Master Winners Announced.</a>  Master Level Winner - Akshay Singh, India, Apprentice Level - Rick LaMont, USA </div>
<div> </div>
<div><strong>OOPSLA</strong> (Now a part of SPLASH - <strong>Systems, Programming, Languages and Applications: Software for Humanity</strong>.) October 22-27 in Portland  - a panel discussion on “Multicore, Manycore, and Cloud Computing: Is a new programming language paradigm required?<strong>” </strong><a href="http://splashcon.org/2011/program/panels/229-multicore-manycore-and-cloud-computing-is-a-new-programming-language-paradigm-required">http://splashcon.org/2011/program/panels/229-multicore-manycore-and-cloud-computing-is-a-new-programming-language-paradigm-required</a> </div>
<div>
<p>The Intel Academic Community will soon be hosting new rounds of <a href="http://software.intel.com/en-us/articles/parallelism-content-awards/">microgrant funding to create parallel programming training material</a>. The first round will be focused on Data Structures. If you’re in academia and have some ideas about teaching parallelism, go to the IAC microgrant site for more information. I’ll have the URL in the show notes or you can find it from the academic community homepage.</p>
</div>
<p>If you have comments, questions, suggestions for guests or show topics, news to share that you think would be of interest, we’d love to hear from you. Clay, where can they send those ideas?</p>
<p><strong>K:</strong> Now for something we know you’ll really like – our guest.</p>
<p>Questions asked during the show:</p>
<ul>
<li>Welcome to Parallel Programming Talk – Michael. Before we get into the TBB Flow Graph discussion, could you tell us a little about yourself – background, what do you do at Intel</li>
<li><strong>C:</strong> In case our viewers don’t know, what is TBB and why are Flow Graphs important – how does this compare with the TBB Graph API? </li>
</ul>
<p>              M:Used to be called “the TBB graph API” is now the “TBB flow graph”.</p>
<ul>
<li>K: What is this used for?</li>
</ul>
<p>               M:The TBB flow graph can express acyclic dependency graphs, as well as acyclic and cyclic messaging graphs.</p>
<ul>
<li>C: How long has this feature existed? Who has used it? What kind of response have you received?</li>
</ul>
<p>              M: Has been a Community Preview feature since Intel® TBB 3.0 U5 (Dec 2010) and now is a full feature in TBB 4.0 (? September 2011 ?) Since Dec 2010, evaluated by customers across media, gaming,       financial services and technical computing</p>
<ul>
<li>K: What were customers doing before the flow graph?</li>
</ul>
<p>              M: Forced apps in to the linear Intel® TBB pipeline, Built their own abstractions over Intel® TBB tasks, Or built their own thread-based graph libraries</p>
<ul>
<li>C: What can flow graphs be used for? Do you have any examples you can share?</li>
</ul>
<p>              M:A flow graph is made of a graph object, nodes and edges.  The nodes may execute user code, buffer or direct messages.  The graph object is the parent of all of the tasks executing in the graph.  The edges make the connections between nodes explicit. </p>
<p>Watch Video to see the examples</p>
<p>             M:Some apps that fit a flow graph can already be expressed using the TBB pipeline or a graphs of tasks</p>
<ul>
<li>If an app can be fit into a pipeline
<ul>
<li>the flow graph version and pipeline version usually have similar performance</li>
<li>The pipeline version will tend to require less code, since the edges are implicit and there is no need for join or split nodes</li>
<li>If an app can be fit into a graph of tasks
<ul>
<li>the flow graph version and task graph version usually have similar performance</li>
<li>the flow graph version will be simpler to implement and require less code</li>
<li>Many apps are impractical to express as a pipeline or graph of tasks, but can be handled by the TBB flow graph</li>
</ul>
</li>
</ul>
</li>
<li>C:Can viewers try this out for themselves?</li>
<li>K: Michael, thanks for being our guest today – how can our community members  learn more?</li>
</ul>
<p>                    M: Lots of blogs will be available soon at <a href="http://software.intel.com/en-us/blogs/tag/flow_graph">http://software.intel.com/en-us/blogs/tag/flow_graph</a></p>
<p>If you have comments, questions, suggestions for guests or show, send those ideas    </p>
<p>To  <a href="mailto:parallelprogrammingtalk@intel.com">parallelprogrammingtalk@intel.com</a></p>
<p>Today’s show was posted September 9. There will not be a new show on September 27.  We will have a number of recordings and special episodes available soon after. Weekly Tuesday morning streamed shows will resume taping on September 20 with a release on the following Friday. Watch the calendar on the Parallel Programming Home page for the latest info on Parallel Programming Talk and additional Community Events.</p>
<p><strong>A wise old programmer once told me - a trouble parallelized is a trouble halved</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/09/10/flow-graph-feature-in-tbb-40-with-michael-voss-parallel-programming-talk-121/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Implementing a wave-front computation using the Intel® Threading Building Blocks flow graph</title>
		<link>http://software.intel.com/en-us/blogs/2011/09/09/implementing-a-wave-front-computation-using-the-intel-threading-building-blocks-flow-graph/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/09/09/implementing-a-wave-front-computation-using-the-intel-threading-building-blocks-flow-graph/#comments</comments>
		<pubDate>Fri, 09 Sep 2011 15:19:31 +0000</pubDate>
		<dc:creator>Michael Voss (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[flow_graph]]></category>
		<category><![CDATA[Intel TBB]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/09/09/implementing-a-wave-front-computation-using-the-intel-threading-building-blocks-flow-graph/</guid>
		<description><![CDATA[The Intel® Threading Building Blocks ( Intel® TBB )  flow graph is fully supported in Intel® TBB 4.0.  If you are unfamiliar with the flow graph, you can read an introduction here. One node type available for use with the flow graph is continue_node&#60;T&#62;.  This node type is designed for implementing dependency graphs, where nodes wait [...]]]></description>
			<content:encoded><![CDATA[<p>The Intel® Threading Building Blocks ( Intel® TBB )  flow graph is fully supported in Intel® TBB 4.0.  If you are unfamiliar with the flow graph, you can read an introduction <a href="http://software.intel.com/en-us/blogs/2011/09/08/the-intel-threading-building-blocks-flow-graph-is-now-fully-supported/">here</a>.</p>
<p>One node type available for use with the flow graph is <code>continue_node&lt;T&gt;</code>.  This node type is designed for implementing dependency graphs, where nodes wait for their predecessors to complete before beginning their own work.   A <code>continue_node</code> does not receive data messages from its predecessors, but instead counts the number of <code>continue_msg</code> signals that it receives.  Once it receives P messages, one for each predecessor, it executes its body which generates an output message of type <code>T</code>.  Often, the output type is also a <code>continue_msg</code> but it need not be. </p>
<p>Pictorially, we draw a <code>continue_node</code> as below:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/08/cnode.png"><img class="size-full wp-image-35883 aligncenter" title="cnode" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/08/cnode.png" alt="" width="120" height="93" /></a></p>
<p style="text-align: left;"> </p>
<p>This symbol tries to convey important properties about a <code>continue_node</code>.  The input arc has lines above it to indicate that it counts incoming messages.  The interior of the circle contains <code>f()</code> to indicate that the body is a functor that is passed no argument.</p>
<p>Figure 1 shows an approach to implementing a wave front computation using a set of <code>continue_node</code> objects.  In this example, each computation must wait for the computation above it and the computation to its left to complete before it can start executing.  Most nodes have two predecessors and therefore will not start executing until they receive two <code>continue_msg</code> messages.  Nodes on the top and left edges have only a single predecessor and therefore wait for only a single message to arrive.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/08/wave_picture.bmp"><img class="aligncenter size-full wp-image-35879" title="wave_picture" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/08/wave_picture.bmp" alt="" width="428" height="314" /></a></p>
<p><strong>Figure 1: Using an Intel® Threading Building Blocks flow graph to express a wave-front calculation</strong></p>
<p>I'll now provide the complete code necessary to implement an example that performs such a computation using an Intel® TBB flow graph. In this example, the computation at each node will update a block of a 2 dimensional matrix. If the values held by an element's left and upper neighbors are equal, then the element's value will be set to be 2 times that value. Otherwise, the element's value will be set to be the maximum of the two values. The top-left element is initialize with a value of 1. So for a 5x5 matrix, the results would be:</p>
<p><code>1 1 1 1 1</code><br />
<code>1 2 2 2 2</code><br />
<code>1 2 4 4 4</code><br />
<code>1 2 4 8 8</code><br />
<code>1 2 4 8 16</code></p>
<p>The code below includes the necessary headers, defines some parameters, and defines the function <code>calc</code> that performs the calculation on each matrix element. The constants <code>M</code> and <code>N</code> define the size of the matrix. The dimension of the blocks computed in each node is given by the blocksize, <code>B</code>, and the number of blocks in each dimenstion is computed and stored in <code>MB</code> and <code>NB</code>. The 2-D matrix, <code>values</code>, is where the results will be stored.</p>
<p><code>#include &lt;algorithm&gt; // for std::max</code><br />
<code>#include &lt;cstdio&gt;</code></p>
<p><code>#include "tbb/flow_graph.h" </code></p>
<p><code>using namespace tbb;</code><br />
<code>using namespace tbb::flow;</code></p>
<p><code>int M=1000, N=1000;</code><br />
<code>int B = 100;</code><br />
<code>int MB = (M/B) + (M%B&gt;0);</code><br />
<code>int NB = (M/B) + (M%B&gt;0);</code></p>
<p><code>double **value;</code></p>
<p><code>inline double calc( double v0, double v1 ) {</code><br />
<code>  if ( v0 == v1 )</code><br />
<code>    return 2*v0;</code><br />
<code>  else</code><br />
<code>    return std::max(v0,v1);</code><br />
<code>}</code></p>
<p>The code below builds the flow graph that will apply the function <code>calc</code> to the matrix blocks, while respecting the dependencies shown in Figure 1. The 2-D array, <code>node</code>, is used to hold pointers to the <code>continue_node</code> objects. In <code>BuildGraph</code>, the doubly-nested for loop allocates the <code>continue_node</code> objects. Each node is constructed with a reference to the graph object g and lambda expression that applies <code>calc</code> to its corresponding block of elements. After each node is created, edges are made from it to its successors in the graph, setting up the required dependencies. Note that the loop indices move from the bottom right of Figure 1 to the top left, so each node's successors are allocated before it.</p>
<p><code>continue_node&lt;continue_msg&gt; ***node;</code></p>
<p><code>void BuildGraph( graph &amp;g ) {</code><br />
<code>  value[M-1][N-1] = 0;</code><br />
<code>  for( int i=MB; --i&gt;=0; )</code><br />
<code>    for( int j=NB; --j&gt;=0; ) {</code><br />
<code>      node[i][j] =</code><br />
<code>        new continue_node&lt;continue_msg&gt;( g,</code><br />
<code>                         [=]( const continue_msg&amp; ) {</code><br />
<code>                           int start_i = i*B;</code><br />
<code>                           int end_i = (i*B+B &gt; M) ? M : i*B+B;</code><br />
<code>                           int start_j = j*B;</code><br />
<code>                           int end_j = (j*B+B &gt; N) ? N : j*B+B;</code><br />
<code>                           for ( int ii = start_i; ii &lt; end_i; ++ii ) {</code><br />
<code>                             for ( int jj = start_j; jj &lt; end_j; ++jj ) {</code><br />
<code>                               double v0 = ii == 0 ? 0 : value[ii-1][jj];</code><br />
<code>                               double v1 = jj == 0 ? 0 : value[ii][jj-1];</code><br />
<code>                               value[ii][jj] = ii==0 &amp;&amp; jj==0 ? 1 : calc(v0,v1);</code><br />
<code>                              }</code><br />
<code>                           }</code><br />
<code>                         } );</code><br />
<code>      if ( i + 1 &lt; MB ) make_edge( *node[i][j], *node[i+1][j] );</code><br />
<code>      if ( j + 1 &lt; NB ) make_edge( *node[i][j], *node[i][j+1] );</code><br />
<code>    }</code><br />
<code>}</code></p>
<p>The function <code>EvaluateGraph</code> executes the flow graph. It does this by putting a <code>continue_msg</code> to the top-left element, and then waiting for the activity in the graph to stop. When the call to <code>g.wait_for_all()</code> returns, all of the nodes have been evaluated and the final result produced.</p>
<p><code>double EvaluateGraph( graph &amp;g ) {</code><br />
<code>  node[0][0]-&gt;try_put(continue_msg());</code><br />
<code>  g.wait_for_all();</code><br />
<code>  return value[M-1][N-1];</code><br />
<code>}</code></p>
<p>Since we create a matrix of <code>continue_node</code> objects, we also have to delete them:</p>
<p><code>void CleanupGraph() {</code><br />
<code>  for( int i=0; i&lt;MB; ++i )</code><br />
<code>    for( int j=0; j&lt;NB; ++j )</code><br />
<code>     delete node[i][j];</code><br />
<code>}</code></p>
<p>Finally, the <code>main</code> function shown below invokes these functions to build, evaluate and clean up the flow graph.</p>
<p><code>int main(int argc, char *argv[]) {</code><br />
<code>  value = new double *[M];</code><br />
<code>  for ( int i = 0; i &lt; M; ++i ) value[i] = new double [N];</code></p>
<p><code>  node = new continue_node&lt;continue_msg&gt; **[MB];</code><br />
<code>  for ( int i = 0; i &lt; MB; ++i ) node[i] = new continue_node&lt;continue_msg&gt; *[NB];</code></p>
<p><code>  graph g;</code><br />
<code>  BuildGraph(g);</code><br />
<code>  double result = EvaluateGraph(g);</code><br />
<code>  CleanupGraph();</code><br />
<code>  printf("%g\n", result);</code></p>
<p><code>  for ( int i = 0; i &lt; M; ++i ) delete [] value[i];</code><br />
<code>  for ( int i = 0; i &lt; MB; ++i ) delete [] node[i];</code><br />
<code>  delete [] value;</code><br />
<code>  delete [] node;</code></p>
<p><code>  return 0;</code><br />
<code>}</code></p>
<p>I hope that this example demonstrates that a flow graph can be used to easily express a depedency graph. The basic steps are (1) create a set of <code>continue_node</code> objects, (2) connect these nodes together using calls to <code>make_edge</code>, and (3) start the execution by sending a <code>continue_msg</code> to any nodes that do not have predecessors.</p>
<p>If you are interested in learning more about the Intel® Threading Building Blocks ( Intel® TBB ) flow graph, please check out the other blog articles at <a href="http://software.intel.com/en-us/blogs/tag/flow_graph">http://software.intel.com/en-us/blogs/tag/flow_graph</a> or visit <a href="http://www.threadingbuildingblocks.org">www.threadingbuildingblocks.org</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/09/09/implementing-a-wave-front-computation-using-the-intel-threading-building-blocks-flow-graph/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A feature-detection example using the Intel® Threading Building Blocks flow graph</title>
		<link>http://software.intel.com/en-us/blogs/2011/09/09/a-feature-detection-example-using-the-intel-threading-building-blocks-flow-graph/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/09/09/a-feature-detection-example-using-the-intel-threading-building-blocks-flow-graph/#comments</comments>
		<pubDate>Fri, 09 Sep 2011 15:17:38 +0000</pubDate>
		<dc:creator>Michael Voss (Intel)</dc:creator>
				<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[flow_graph]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/09/09/a-feature-detection-example-using-the-intel-threading-building-blocks-flow-graph/</guid>
		<description><![CDATA[The Intel® Threading Building Blocks ( Intel® TBB )  flow graph is fully supported in Intel® TBB 4.0.  If you are unfamiliar with the flow graph, you can read an introduction here. Figure 1 below shows a flow graph that implements a simple feature detection application. A number of images will enter the graph and two [...]]]></description>
			<content:encoded><![CDATA[<p>The Intel® Threading Building Blocks ( Intel® TBB )  flow graph is fully supported in Intel® TBB 4.0.  If you are unfamiliar with the flow graph, you can read an introduction <a href="http://software.intel.com/en-us/blogs/2011/09/08/the-intel-threading-building-blocks-flow-graph-is-now-fully-supported/">here</a>.</p>
<p>Figure 1 below shows a flow graph that implements a simple feature detection application. A number of images will enter the graph and two alternative feature detection algorithms will be applied to each one. If either algorithm detects a feature of interest, the image will be stored for later inspection. In this article, I’ll describe each node used in this graph, and then provide and described a complete working implementation.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/recog_picture.bmp"><img class="aligncenter size-full wp-image-35950" title="recog_picture" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/recog_picture.bmp" alt="" width="525" height="257" /></a></p>
<p><strong>Figure 1: The Intel® TBB flow graph for the feature-detection example.</strong></p>
<p>In the figure, there are four different type of nodes used to construct the application: a <code>source_node</code>, a <code>queue_node</code>, two <code>join_node</code>s, and several <code>function_node</code>s. Before I provide a sample implementation, I’ll provide a brief overview of each node.</p>
<p>The first type of node is a <code>source_node</code>, which is shown pictorially using the symbol below. This type of node has no predecessors, and is used to generate messages that are injected into the graph. It executes a user functor (or lambda expression) to generate its output. The unfilled circle on its right side indicates that it buffers its output and that this buffer can be reserved. The <code>source_node</code> buffers a single item. When a buffer is reserved, a value is held for the caller until the caller either consumes or releases the value. A <code>source_node</code> will only invoke the user functor when there is nothing currently buffered in its single item output buffer.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/source_node.png"><img class="aligncenter size-full wp-image-35959" title="source_node" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/source_node.png" alt="" width="96" height="99" /></a></p>
<p>The second type of node is a <code>queue_node</code>, which is show using the figure below. A queue_node is an unbounded first-in first-out buffer. Like the <code>source_node</code>, its output is reservable.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/queue_node.png"><img class="aligncenter size-full wp-image-35960" title="queue_node" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/queue_node.png" alt="" width="117" height="87" /></a></p>
<p>The third type of node, of which there are two variants used in the example, is the <code>join_node</code>. A <code>join_node</code> has multiple input ports and generates a single output tuple that contains a value received at each port. A <code>join_node</code> can use different policies at its input ports: <code>queueing</code>, <code>reserving</code> or <code>tag_matching</code>. A <code>queueing join_node</code>, greedily consumes all messages as they arrive and generates an output whenever it has at least 1 item at each input queue. A <code>reserving join_node</code> only attempts to generate a tuple when it can successfully reserve an item at each input port. If it cannot successfully reserve all inputs, it releases all of its reservations and will only try again when it receives a message from the port or ports it was previously unable to reserve. Lastly, a <code>tag_matching join_node</code> uses hash tables to buffer messages in its input ports. When it has received messages at each port that have matching keys, it creates an output tuple with these messages. Shown below are the symbol for the <code>reserving</code> and <code>tag_matching join_node</code>s used in Figure 1.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/reserving_join.png"><img class="aligncenter size-full wp-image-35961" title="reserving_join" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/reserving_join.png" alt="" width="96" height="94" /></a><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/tag_matching_join.png"><img class="aligncenter size-full wp-image-35962" title="tag_matching_join" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/tag_matching_join.png" alt="" width="103" height="102" /></a></p>
<p>The final node type used in this example is a <code>function_node</code>; it uses the symbol shown below. A <code>function_node</code> executes a user-provided functor or lambda expression on incoming messages, passing the return value to its successors. A <code>function_node</code> can be constructed with a limited or unlimited allowable concurrency level. A <code>function_node</code> with unlimited concurrency creates a task to apply its functor to each message as they arrive. If a <code>function_node</code> has limited concurrency, it will create tasks only up to its allowed concurrency level, buffering messages at its input as necessary so that they are not dropped.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/function_node.png"><img class="aligncenter size-full wp-image-35963" title="function_node" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/function_node.png" alt="" width="130" height="103" /></a></p>
<p>To save on space, I’m going to fake the image processing parts of this example. In particular, each image will simply be an array of characters. An image that contains the character ‘A’ has a feature recognizable by algorithm A, and an image that contains the character ‘B’ has a feature recognizable by algorithm B. So in the post, I will provide the complete code to construct and execute a flow graph that has the structure shown in Figure 1, but I’ll replace the actual computations with trivial ones.</p>
<p>Below is the declaration of <code>struct image</code>, as well as the trivial implementations that can be used as the bodies of the function nodes. The function <code>get_next_image</code> will be used by the <code>source_node</code> to generate images for processing. You might note that in <code>get_next_image</code>, every 11th image will have a feature detectable by algorithm A and every 13th image will contain a feature detectable by algorithm B. The function <code>preprocess_image</code> adds a simple offset to each character, and <code>detect_with_A</code> and <code>detect_with_B</code> do the trivial search for the characters 'A' and 'B', respectively.</p>
<p><code>#include &lt;cstring&gt;</code><br />
<code>#include &lt;cstdio&gt;</code></p>
<p><code>const int num_image_buffers = 100;</code><br />
<code>int image_size = 10000000;</code></p>
<p><code>struct image {</code><br />
<code>   const int N;</code><br />
<code>   char *data;</code><br />
<code>   image();</code><br />
<code>   image( int image_number, bool a, bool b );</code><br />
<code>};</code></p>
<p><code>image::image() : N(image_size) {</code><br />
<code>   data = new char[N];</code><br />
<code>}</code></p>
<p><code>image::image( int image_number, bool a, bool b ) : N(image_size) {</code><br />
<code>    data = new char[N];</code><br />
<code>    memset( data, '\0', N );</code><br />
<code>    data[0] = (char)image_number - 32;</code><br />
<code>    if ( a ) data[N-2] = 'A';</code><br />
<code>    if ( b ) data[N-1] = 'B';</code><br />
<code>}</code></p>
<p><code>int img_number = 0;</code><br />
<code>int num_images = 64;</code><br />
<code>const int a_frequency = 11;</code><br />
<code>const int b_frequency = 13;</code></p>
<p><code>image *get_next_image() {</code><br />
<code>    bool a = false, b = false;</code><br />
<code>    if ( img_number &lt; num_images ) {</code><br />
<code>        if ( img_number%a_frequency == 0 ) a = true;</code><br />
<code>        if ( img_number%b_frequency == 0 ) b = true;</code><br />
<code>        return new image( img_number++, a, b );</code><br />
<code>    } else {</code><br />
<code>       return false;</code><br />
<code>    }</code><br />
<code>}</code></p>
<p><code>void preprocess_image( image *input_image, image *output_image ) {</code><br />
<code>    for ( int i = 0; i &lt; input_image-&gt;N; ++i ) {</code><br />
<code>        output_image-&gt;data[i] = input_image-&gt;data[i] + 32;</code><br />
<code>    }</code><br />
<code>}</code></p>
<p><code>bool detect_with_A( image *input_image ) {</code><br />
<code>    for ( int i = 0; i &lt; input_image-&gt;N; ++i ) {</code><br />
<code>        if ( input_image-&gt;data[i] == 'a' )</code><br />
<code>            return true;</code><br />
<code>    }</code><br />
<code>    return false;</code><br />
<code>}</code></p>
<p><code>bool detect_with_B( image *input_image ) {</code><br />
<code>    for ( int i = 0; i &lt; input_image-&gt;N; ++i ) {</code><br />
<code>        if ( input_image-&gt;data[i] == 'b' )</code><br />
<code>            return true;</code><br />
<code>    }</code><br />
<code>    return false;</code><br />
<code>}</code></p>
<p><code>void output_image( image *input_image, bool found_a, bool found_b ) {</code><br />
<code>    bool a = false, b = false;</code><br />
<code>    int a_i = -1, b_i = -1;</code><br />
<code>    for ( int i = 0; i &lt; input_image-&gt;N; ++i ) {</code><br />
<code>        if ( input_image-&gt;data[i] == 'a' ) { a = true; a_i = i; }</code><br />
<code>        if ( input_image-&gt;data[i] == 'b' ) { b = true; b_i = i; }</code><br />
<code>    }</code><br />
<code>    printf("Detected feature (a,b)=(%d,%d)=(%d,%d) at (%d,%d) for image %p:%d\n",</code><br />
<code>a, b, found_a, found_b, a_i, b_i, input_image, input_image-&gt;data[0]);</code><br />
<code>}</code></p>
<p>The code to implement the flow graph itself is shown in function <code>main</code> below. I will interject text in the middle of the listing of <code>main</code> to describe the use of the flow graph components. If you want to build this example, you can just cut and paste the code snippets above and below linearly into a single file.</p>
<p><code>int num_graph_buffers = 8;</code></p>
<p><code>#include "tbb/flow_graph.h"</code></p>
<p><code>using namespace tbb;</code><br />
<code>using namespace tbb::flow;</code></p>
<p><code>int main() {</code></p>
<p>First, a <code>graph g</code> is created. All of the nodes will belong to this single graph. A few typedefs are provided to make it easier to refer to the outputs of the join nodes:</p>
<p><code>    graph g;</code></p>
<p><code>    typedef std::tuple&lt; image *, image * &gt; resource_tuple;</code><br />
<code>    typedef std::pair&lt; image *, bool &gt; detection_pair;</code><br />
<code>    typedef std::tuple&lt; detection_pair, detection_pair &gt; detection_tuple;</code></p>
<p>Next, the <code>queue_node</code> that holds the images buffers is created, along with the two join nodes. Again, note that the <code>resource_join</code> is using the <code>reserving</code> policy, while <code>detection_join</code> uses the <code>tag_matching</code> policy. To use <code>tag_matching</code>, the user must provide functors that can extract the tag from the item; these appear as the additional arguments to the constructor.</p>
<p><code>    queue_node&lt; image * &gt; buffers( g );</code><br />
<code>    join_node&lt; resource_tuple, reserving &gt; resource_join( g );</code><br />
<code>    join_node&lt; detection_tuple, tag_matching &gt; detection_join( g,</code><br />
<code>[](const detection_pair &amp;p) -&gt; size_t { return (size_t)p.first; },</code><br />
<code>            [](const detection_pair &amp;p) -&gt; size_t { return (size_t)p.first; }  );</code></p>
<p>Next, the nodes that execute the user’s code are created, including the <code>source_node</code> and the four <code>function_nodes</code>. The user’s code is passed to each node using a C++ lambda expression ( a function object could also be used ). For the most part, each lambda expression is a bit of wrapper code that calls the functions that were described earlier, obtaining inputs and creating outputs as necessary. The <code>make_edge</code> calls wire together the nodes as shown in Figure 1.</p>
<p><code>    source_node&lt; image * &gt; src( g,</code><br />
<code>                                []( image* &amp;next_image ) -&gt; bool {</code><br />
<code>                                    next_image = get_next_image();</code><br />
<code>                                    if ( next_image ) return true;</code><br />
<code>                                    else return false;</code><br />
<code>                                }</code><br />
<code>                              );</code><br />
<code>    make_edge(src, input_port&lt;0&gt;(resource_join) );</code><br />
<code>    make_edge(buffers, input_port&lt;1&gt;(resource_join) );</code></p>
<p><code>    function_node&lt; resource_tuple, image * &gt;</code><br />
<code>        preprocess_function( g, unlimited,</code><br />
<code>                             []( const resource_tuple &amp;in ) -&gt; image * {</code><br />
<code>                                 image *input_image = std::get&lt;0&gt;(in);</code><br />
<code>                                 image *output_image = std::get&lt;1&gt;(in);</code><br />
<code>                                 preprocess_image( input_image, output_image );</code><br />
<code>                                 delete input_image;</code><br />
<code>                                 return output_image;</code><br />
<code>                             }</code><br />
<code>                           );</code></p>
<p><code>    make_edge(resource_join, preprocess_function );</code></p>
<p><code>    function_node&lt; image *, detection_pair &gt;</code><br />
<code>        detect_A( g, unlimited,</code><br />
<code>                 []( image *input_image ) -&gt; detection_pair {</code><br />
<code>                    bool r = detect_with_A( input_image );</code><br />
<code>                    return std::make_pair( input_image, r );</code><br />
<code>                 }</code><br />
<code>               );</code></p>
<p><code>    function_node&lt; image *, detection_pair &gt;</code><br />
<code>        detect_B( g, unlimited,</code><br />
<code>                 []( image *input_image ) -&gt; detection_pair {</code><br />
<code>                    bool r = detect_with_B( input_image );</code><br />
<code>                    return std::make_pair( input_image, r );</code><br />
<code>                 }</code><br />
<code>               );</code></p>
<p><code>    make_edge(preprocess_function, detect_A );</code><br />
<code>    make_edge(detect_A, input_port&lt;0&gt;(detection_join) );</code><br />
<code>    make_edge(preprocess_function, detect_B );</code><br />
<code>    make_edge(detect_B, input_port&lt;1&gt;(detection_join) );</code></p>
<p><code>    function_node&lt; detection_tuple, image * &gt;</code><br />
<code>        decide( g, serial,</code><br />
<code>                 []( const detection_tuple &amp;t ) -&gt; image * {</code><br />
<code>                     const detection_pair &amp;a = std::get&lt;0&gt;(t);</code><br />
<code>                     const detection_pair &amp;b = std::get&lt;1&gt;(t);</code><br />
<code>                     image *img = a.first;</code><br />
<code>                     if ( a.second || b.second ) {</code><br />
<code>                         output_image( img, a.second, b.second );</code><br />
<code>                     }</code><br />
<code>                     return img;</code><br />
<code>                 }</code><br />
<code>               );</code></p>
<p><code>    make_edge(detection_join, decide);</code><br />
<code>    make_edge(decide, buffers);</code></p>
<p>Because of the reserving join node at the front of the graph, the graph will remain idle until there are image buffers available in the <code>buffers</code> queue. The for-loop below allocates and puts buffers into the queue. After the loop, the call to <code>g.wait_for_all()</code> will block until the graph again becomes idle when all images are processed.</p>
<p><code>    // Put image buffers into the buffer queue</code><br />
<code>    for ( int i = 0; i &lt; num_graph_buffers; ++i ) {</code><br />
<code>        image *img = new image;</code><br />
<code>        buffers.try_put( img );</code><br />
<code>    }</code><br />
<code>    g.wait_for_all();</code></p>
<p>When the graph is idle, all of the buffers will again be in the buffers queue. The <code>queue_node</code> therefore needs to be drained and the buffers deallocated.:</p>
<p><code>    for ( int i = 0; i &lt; num_graph_buffers; ++i ) {</code><br />
<code>        image *img = NULL;</code><br />
<code>        if ( !buffers.try_get(img) )</code><br />
<code>            printf("ERROR: lost a buffer\n");</code><br />
<code>        else</code><br />
<code>            delete img;</code><br />
<code>    }</code><br />
<code>return 0;</code><br />
<code>} </code></p>
<p>I hope that this feature-detection example demonstrates how a reasonably complex flow graph that passes messages between nodes can be implemented. To learn more about the new features in Intel® Threading Building Blocks 4.0, visit <a href="http://www.threadingbuildingblocks.org">http://www.threadingbuildingblocks.org</a> or to learn more about the Intel® TBB flow graph, check-out the other blog articles at <a href="http://software.intel.com/en-us/blogs/tag/flow_graph/">http://software.intel.com/en-us/blogs/tag/flow_graph/</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/09/09/a-feature-detection-example-using-the-intel-threading-building-blocks-flow-graph/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Do you have a face for parallelism?</title>
		<link>http://software.intel.com/en-us/blogs/2011/09/08/do-you-have-a-face-for-parallelism/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/09/08/do-you-have-a-face-for-parallelism/#comments</comments>
		<pubDate>Thu, 08 Sep 2011 23:19:28 +0000</pubDate>
		<dc:creator>Clay Breshears (Intel)</dc:creator>
				<category><![CDATA[Events]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Array Building Blocks]]></category>
		<category><![CDATA[Cilk Plus]]></category>
		<category><![CDATA[idf]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/09/08/do-you-have-a-face-for-parallelism/</guid>
		<description><![CDATA[It's never nice to tell someone that they have "a face for radio." But, if you're going to be at IDF 2011, you should attend the "Faces of Parallelism" lab to show off your parallel face.]]></description>
			<content:encoded><![CDATA[<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/Faces-of-Parallelism-Clay-small.jpg"></a></p>
<p>Are you going to the <a href="http://www.intel.com/idf/">Intel Developer Forum</a> next week (13-15 SEP 2011)? If so, you should check out the "Faces of Parallelism" lab that is being held on Wednesday afternoon. This event will be a self-paced lab on parallel programming with the <a href="http://software.intel.com/en-us/articles/intel-parallel-building-blocks/">Intel® Parallel Building Blocks</a>. If you have some experience with one of the programming libraries, this will give you a chance to experience features of one of the others. For example, you may know how to code with Intel® Threading Building Blocks, so you can try Intel® Array Building Blocks or Intel® Cilk Plus.</p>
<p>(<em>OK, OK. Stifle that yawn. You've been there and done that. I get it.</em>)</p>
<p>No need to worry about getting out of your depth, either, since each lab will have three levels based on programmer experience. You will be able to participate even if you've never touched a thread in your code before.</p>
<p>If you're an expert in one or more of these models, there is still something new we can offer to you. Once you've gone through the PBB labs, there will be an opprotunity for you to do a  fourth programming lab that you will just have to attend the lab to see. However, you will watch the latest Intel® silicon features be unlocked via Intel’s optimized Software Tools.</p>
<p>(<em>Do I have your attention, now? Good, 'cause I'm about to turn it up to "</em>11<em>"</em>).</p>
<p>That's not the best part, though. The title "Faces of Parallelism" isn't just a catchy phrase. There is a reason for the choice. After you've spent some time working through some of the interesting exercises, we want to know what you think about the programming model you've just used and how you think that technology would be useful to you. To preserve your comments and ideas, we will have a video crew standing by to tape your testimonial. (And show your face as a "face of parallelism".)</p>
<p>After the event is over, a squad of experts will be reviewing the comments submitted by participants and choosing the best submission in each of the four programming areas. Those with the best submissions will be awarded a great prize!</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/Faces-of-Parallelism-Clay-small.jpg"><img class="alignright size-full wp-image-36117" title="Faces of Parallelism" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/Faces-of-Parallelism-Clay-small.jpg" alt="" width="166" height="166" /></a>So, if you've got a face (like one of these clowns pictured here) and an interest in parallel programming at any level of experience, this promises to be a fun and informative event. I hope you'll want to join me and the other proctors (each an expert in the programming models being featured) to try something new that can be added to your parallel programming repertoire.</p>
<p>When you get your IDF material, look for the "Faces of Parallelism" lab on Wednesday afternoon. It is scheduled for 4 hours, but you can drop by anytime you have 20-30 minutes free; there's no strict start time after the event gets underway. Join us and you, too, could be one of the next faces of parallelism.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/09/08/do-you-have-a-face-for-parallelism/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

