<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blogs &#187; Software Tools</title>
	<atom:link href="http://software.intel.com/en-us/blogs/category/software-tools/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<lastBuildDate>Fri, 25 May 2012 22:49:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Doctor Fortran in &quot;I Can C Clearly Now, Part I&quot;</title>
		<link>http://software.intel.com/en-us/blogs/2012/05/11/doctor-fortran-in-i-can-c-clearly-now-part-i/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/05/11/doctor-fortran-in-i-can-c-clearly-now-part-i/#comments</comments>
		<pubDate>Fri, 11 May 2012 20:38:25 +0000</pubDate>
		<dc:creator>Steve Lionel (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[C interoperability]]></category>
		<category><![CDATA[Fortran]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/05/11/doctor-fortran-in-i-can-c-clearly-now-part-i/</guid>
		<description><![CDATA[Spend any time in the comp.lang.fortran newsgroup, or other places where programming languages are discussed, and you’ll soon see a new “Which is better, Fortran or C?” thread show up. These never fail to produce heated comments from people who should know better. My answer is that neither is “better” – each has its strengths [...]]]></description>
			<content:encoded><![CDATA[<p>Spend any time in the comp.lang.fortran newsgroup, or other places where programming languages are discussed, and you’ll soon see a new “Which is better, Fortran or C?” thread show up. These never fail to produce heated comments from people who should know better. My answer is that neither is “better” – each has its strengths and weaknesses.</p>
<p>For decades, smart programmers have used both in their applications, using C where it made sense and Fortran where that made sense. This was made easier by vendor-specific extensions to Fortran that dealt with things such as case-sensitive names and pass-by-value. Extensions such as %VAL and LOC have become so ingrained into the Fortran culture that many are astonished to find that they are non-standard.</p>
<p>Fortran 2003 added a whole class of features for “C interoperability” to the standard, finally enabling mixed-language programming in a reasonably portable manner. I am not aware of any other major programming language standard that has extended a hand in this manner. While many Fortran programmers have warmly embraced the new features, there’s still a lot of confusion about them, and I thought it was time to try to explain the new landscape.</p>
<p>This is a big topic, so I am going to split it up across several posts.</p>
<h2>Interoper-what?</h2>
<p>First, some definitions.  The Fortran standard talks about interoperability with a “companion C processor”. (In Fortran-speak, “processor” means something that understands and runs code written in the language. For the most part, you can substitute “compiler”, but keep in mind that the compiler operates in an OS and CPU environment that may affect its behavior.) Each Fortran implementation is free to choose which C is its “companion”.  For Intel Fortran, that is Microsoft Visual C++ on Windows, and gcc on Linux and OS X. What about Intel C++?  That is also compatible with Visual C++ on Windows and gcc on Linux and OS X, so Intel Fortran will also interoperate with Intel C++.</p>
<p>Note that the standard says “companion C processor”, not “companion C++ processor”.  In particular, the standard references the C99 standard, or ISO/IEC 9899:1999 to be specific. The companion processors may also build C++ code, but standard interoperability assumes C. You can use C++, but must stick to what is compatible with C when interoperating with Fortran.</p>
<p>What is meant by “interoperability” here? F2008 says it thusly: “Fortran provides a means of referencing procedures that are defined by means of the C programming language or procedures that can be described by C prototypes…, even if they are not actually defined by means of C. Conversely, there is a means of specifying that a procedure defined by a Fortran subprogram can be referenced by a function defined by means of C. In addition, there is a means of defining global variables that are associated with C variables whose names have external linkage.”  To this, I will add that there are also means to declare Fortran variables, data structures and enumerations that correspond to similar declarations in C.</p>
<p>Fortran provides four major “tools” for enabling interoperability with C.  These are:</p>
<ul>
<li>Restrictions on which Fortran types are considered interoperable</li>
<li>The BIND(C) <em>language-binding-spec</em></li>
<li>The ISO_C_BINDING intrinsic module</li>
<li>The VALUE attribute</li>
</ul>
<p>I frequently see people refer to all of the interoperability tools as “ISO_C_BINDING”, but this is not correct; one can use the interoperability features without using the module.</p>
<h2>Interoperable data types</h2>
<p>The core concept of interoperability is that something should work the same way in Fortran as it does in C. While Fortran and C each support many of the same basic data types, not everything translates cleanly.</p>
<p>One difference is that Fortran has the concept of “kinds”, whereas C considers these somewhat distinct types. For example, consider the Fortran INTEGER type.  C has numerous integer types, from <strong>short int</strong> to <strong>long long int</strong>, and some specialty types such as <strong>intptr_t</strong>. These may or may not have corresponding kinds in Fortran.  For each of the C integer types which might be interoperable, ISO_C_BINDING declares a named constant (PARAMETER) giving the kind number for the implementation’s equivalent INTEGER kind.</p>
<p>For example, there’s the simple C <strong>int</strong> type.  This corresponds to INTEGER(C_INT), where C_INT is defined in ISO_C_BINDING.  In Intel Fortran, the value is always 4, as a C int corresponds with Fortran INTEGER(4), but some other Fortran may use different kind numbers.  Using the named constant ensures portability.</p>
<p>More interesting is the C <strong>intptr_t</strong> type.  This is an integer that is large enough to hold a pointer (address). In Intel Fortran, this would be INTEGER(4) when building a 32-bit application and INTEGER(8) for a 64-bit application.  Intel Fortran provides different copies of ISO_C_BINDING for various platforms so you always get the right one.</p>
<p>Note that Fortran has no unsigned integer types, so there are no constants for C’s unsigned types. Such types are <strong>not</strong> interoperable.</p>
<p>You might wonder what happens if there is a “kind” of C type not supported by the Fortran implementation. The answer is that the named constant for that type is defined as -1, so you’ll get a compile-time error if you try to use it. We’ll see a use of this shortly.</p>
<p>Similarly, there are constants defined for REAL, COMPLEX, LOGICAL and CHARACTER.   For REAL, the standard offers the possibility of a C <strong>long double</strong> type. This is implemented in different ways by various C compilers on various platforms supported by Intel Fortran.  In gcc on 32-bit Linux, <strong>long double</strong> is an 80-bit floating type, as supported by the X87 instruction set.  Intel Fortran doesn’t support this, so there, C_LONG_DOUBLE is -1. gcc on OS X, however, defines it as a 128-bit type that is the same as Intel Fortran’s REAL(16), so C_LONG_DOUBLE is 16 there.  And on 64-bit Linux, or on Windows, long double is treated the same as double, so C_LONG_DOUBLE is 8.  As long as you use the constants for kind values and the corresponding types in C, you’ll match.</p>
<p>LOGICAL and CHARACTER need special treatment when it comes to interoperability.  The Fortran standard says that LOGICAL corresponds to C’s <strong>_Bool</strong> type, and defines a single kind value C_BOOL, which is 1 in Intel Fortran. But Intel Fortran, by default, tests LOGICALs for true/false differently than C does.  Where C uses zero for false and not-zero for true, Intel Fortran defaults to treating even values as false and odd values as true. If you are going to use LOGICAL types to interoperate with C, be sure to specify the –fpscomp logicals (/fpscomp:logicals) option, which changes the interpretation to be C-like.  This is included if you use –standard-semantics (/standard-semantics) – I recommend using this option any time you use Fortran 2003 (or later) features.</p>
<p>Now we come to CHARACTER. C does not have character strings, at least not in the way Fortran does.  Really.  It has arrays of single characters, so this is how you must represent things in Fortran.  There is a kind value defined, C_CHAR, corresponding to the C <strong>char</strong> type. But only length 1 character variables are interoperable.  I’ll talk more about that when I come to procedure arguments, but just know that it is not as dire a situation as you might think.</p>
<p>Derived types can also be interoperable, and that will be discussed next time when I talk about BIND(C).</p>
<p>There are other restrictions on interoperable variables. Scalar variables are interoperable only if their type parameters (kind and length) are interoperable (see above), they are not a Coarray, do not have the POINTER or ALLOCATABLE attribute (this may change in the future, I’ll talk about that in another post), and if character its length is not assumed nor defined by a non-constant expression. (Wait, I thought you said only length 1 was interoperable!  Patience, grasshopper…)</p>
<p>Arrays are interoperable if the base type meets the scalar variable requirements above, if it is explicit shape or assumed-size, and is not zero-sized. Furthermore, assumed-size arrays are interoperable only with C arrays that have no size specified. There are some additional rules on rank, in particular, C arrays with rank greater than 1 are not interoperable because they are “arrays of arrays”.</p>
<h2>To be continued…</h2>
<p>The next post will be dedicated to BIND(C), in all its manifestations.  “C” you then!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/05/11/doctor-fortran-in-i-can-c-clearly-now-part-i/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Deterministic Reduction: a new Community Preview Feature in Intel® Threading Building Blocks</title>
		<link>http://software.intel.com/en-us/blogs/2012/05/11/deterministic-reduction-a-new-community-preview-feature-in-intel-threading-building-blocks/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/05/11/deterministic-reduction-a-new-community-preview-feature-in-intel-threading-building-blocks/#comments</comments>
		<pubDate>Fri, 11 May 2012 10:22:42 +0000</pubDate>
		<dc:creator>Alexei Katranov (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Server]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Computer Arithmetic]]></category>
		<category><![CDATA[deterministic calculations]]></category>
		<category><![CDATA[floating point]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[parallel_deterministic_reduce]]></category>
		<category><![CDATA[parallel_reduce]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/05/11/deterministic-reduction-a-new-community-preview-feature-in-intel-threading-building-blocks/</guid>
		<description><![CDATA[Computer Arithmetic has a lot of peculiarities [1]. One of these pitfalls is associativity failure in floating point arithmetic. For example, the two sums of fractions calculations below will not produce the same result when using floats: In a sequential program, it is not a big problem since the calculation order is exactly specified so [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">Computer Arithmetic has a lot of peculiarities <a title="What every computer scientist should know about floating-point arithmetic, David Goldberg, Xerox Palo Alto Research Center, Palo Alto, CA, 1991." href="http://dx.doi.org/10.1145/103162.103163">[1]</a>. One of these pitfalls is associativity failure in floating point arithmetic. For example, the two sums of fractions calculations below will not produce the same result when using <code>float</code>s:</p>
<h4 style="text-align: center;"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/formula.png"><img class="size-large wp-image-47370 aligncenter" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/formula-1024x219.png" alt="The sum of fractions depend on the calculation order" width="461" height="99" align="middle" /></a></h4>
<p style="text-align: justify;">In a sequential program, it is not a big problem since the calculation order is exactly specified so the result is predictable and repeatable. The situation is not so clear in parallel programming.</p>
<p style="text-align: justify;">To make the example parallel, I used the parallel_reduce template function from Intel® Threading Building Blocks (Intel® TBB):</p>
<pre name="code" class="cpp:nocontrols">std::vector&lt;float&gt; arr( N, 1.0f/(float)N );
float sum = tbb::parallel_reduce( tbb::blocked_range( arr.begin(), arr.end() ), 0.0f,
    []( const tbb::blocked_range&amp; r, float sum ) {
        return std::accumulate( r.begin(), r.end(), sum );
    },
    std::plus&lt;float&gt;() );
std::cout &lt;&lt; sum &lt;&lt; std::endl;</pre>
<p style="text-align: justify;">As in the examples above, the code calculates the sum of N fractions, but it uses multiple processor cores if available. As it is well known, we face a disappointing fact of different results being possible for different orders of calculations. If we run it 10 times and N=1000 we will get something like this:</p>
<blockquote><p>0.999991<br />
1<br />
0.999999<br />
0.999996<br />
0.999998<br />
0.999998<br />
0.999998<br />
1<br />
0.999997<br />
0.999998</p></blockquote>
<p style="text-align: justify;">It’s worth mentioning that the result differs from run to run! In spite of the fact that the developer specifies the calculations – when it is calculated in parallel the order of calculation gets out of control.</p>
<p style="text-align: justify;">On the other hand, it is not as bad as all that. Although the OS operates on threads and fills the application with indeterminism, it is still possible to manage the order of calculations. One of the new features of Intel TBB 4.0 is the parallel_deterministic_reduce template algorithm. The algorithm has the same interface as parallel_reduce except that it does not allow you to specify a partitioner. (For parallel_reduce it is possible to pass a partitioner as the last argument.) We will discuss why this restriction exists later. But for now, let’s replace the parallel_reduce with parallel_deterministic_reduce and look at how the result changes:</p>
<pre name="code" class="cpp:nocontrols">std::vector&lt;float&gt; arr( N, 1.0f/(float)N );
float sum = tbb::parallel_deterministic_reduce( tbb::blocked_range( arr.begin(), arr.end() ), 0.0f,
    []( const tbb::blocked_range&amp; r, float sum ) {
        return std::accumulate( r.begin(), r.end(), sum );
    },
    std::plus&lt;float&gt;() );
std::cout &lt;&lt; sum &lt;&lt; std::endl;</pre>
<p>Again run it 10 times:</p>
<blockquote><p>1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1</p></blockquote>
<p style="text-align: justify;">The key point here is that the result is the same from run to run.</p>
<p style="text-align: justify;">The sources of non-determinism in parallel_reduce derive from partitioning and body splitting. Let’s consider each of these subjects:</p>
<ul style="text-align: justify;">
<li>Partitioning. The simple_partitioner determines exactly how many and which subranges are created. It splits the iteration range until each subrange is smaller than a given grain size. Thus the behavior only depends on the range size and grain size specified by the developer. However, other types of partitioning in Intel TBB are non-deterministic: to improve performance of the algorithms, range splitting provided by these partitioners depends on run-time stealing events, which we cannot predict.</li>
</ul>
<ul style="text-align: justify;">
<li>Body splitting. For performance reasons parallel_reduce minimizes body copies: it splits the body only when consecutive subranges are processed by different threads. Thus body splitting, like “advanced” partitioning, also depends on non-deterministic task stealing.</li>
</ul>
<p style="text-align: justify;">The example shows that parallel_reduce is really inapplicable for non-associative operations like floating point arithmetic. To achieve a repeatable result from a reduction with non-associative operations parallel_deterministic_reduce has been developed. From the considerations of partitioning (given above), it follows that only the simple_partitioner can be used for parallel_deterministic_reduce; and thus, no choice of an alternative partitioner is possible. Consequently, parallel_deterministic_reduce always challenges us with choosing an appropriate grain size. And smart body splitting has been disabled for the sake of deterministic behavior, so for each subrange a new body is created. This fact complicates the challenge of grain size selection even more: on the one hand, a small grain size increases the number of body copying and overall overhead, but on the other hand, a big grain size may lead to imbalance and underutilization. Fig. 1 shows the relative performance of parallel_deterministic_reduce (simple_partitioner with various grain sizes) in comparison with parallel_reduce (auto_partitioner with default grain size). An appropriate grain size provides the same performance of parallel_deterministic_reduce as parallel_reduce, - but an incorrectly chosen grain size may lead to significant performance degradation, as shown in Fig.1 at the extremes of the grain size axis.</p>
<h4 style="text-align: center;"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/chart.png"><img class="aligncenter size-full wp-image-47423" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/chart.png" alt="Fig.1. Comparison of parallel_reduce (auto_partitioner) and parallel_deterministic_reduce (simple_partitioner) on Pi calculation example." width="640" height="383" /></a><br />
Fig.1. Comparison of parallel_reduce (auto_partitioner) and parallel_deterministic_reduce (simple_partitioner) on Pi calculation example.</h4>
<p style="text-align: justify;">To demonstrate the split-join order behavior of parallel_deterministic_reduce, a small example is given with range [0, 20) and grain size = 5, similar to examples for parallel_reduce in the Intel TBB Reference manual:</p>
<h4 style="text-align: center;"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/tree.png"><img class="aligncenter size-full wp-image-47427" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/05/tree.png" alt="A tree of subranges" width="410" height="141" /></a><br />
A tree of subranges</h4>
<p style="text-align: justify;">For each right node a new body is created by the body split constructor. The slash marks (/) in the tree show where the body split is performed. Thus, for the current example the parallel_deterministic_reduce will always produce 4 subranges and 4 different bodies associated with them. Each of these subranges may be executed in parallel. When both children of a node finish, the corresponding bodies are merged: the right child body “added” to the left child body (in our examples via the <code>std::plus&lt;float&gt;()</code> binary function).</p>
<p style="text-align: justify;">To conclude, parallel_deterministic_reduce provides a deterministic number and deterministic sizes of subranges, and it exactly defines which pairs of subranges are merged. It’s important to note that a repeatable result obtained with parallel_deterministic_reduce may still be different from that obtained via serial execution. Moreover, the results may be different for various grain sizes, since range splitting depends on the grain size. Also, the algorithm is not targeted to improve the accuracy of calculations. The exact result of 1 in the above example of fraction sum calculation has been obtained by chance. For other examples the algorithm can cause a decrease in accuracy. Overall, parallel_deterministic_reduce is not a replacement to parallel_reduce but an alternative solution for those who need repeatability.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/05/11/deterministic-reduction-a-new-community-preview-feature-in-intel-threading-building-blocks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital Logic Simulation with the Intel® TBB Flow Graph, Part 3: Putting together a simulation</title>
		<link>http://software.intel.com/en-us/blogs/2012/05/05/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-3-putting-together-a-simulation/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/05/05/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-3-putting-together-a-simulation/#comments</comments>
		<pubDate>Sat, 05 May 2012 17:00:39 +0000</pubDate>
		<dc:creator>Terry Wilmarth (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[flow graph]]></category>
		<category><![CDATA[Intel TBB]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/05/05/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-3-putting-together-a-simulation/</guid>
		<description><![CDATA[In Part 2 of this blog, I described a four-bit adder circuit built from components discussed in Part 1. In this last installment, I’ll continue using Intel®TBB’s flow graph to put together some signal input and output devices, and then use those to make a small simulation featuring the four-bit adder from Part 2. Let’s [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/">Part 2</a> of this blog, I described a four-bit adder circuit built from components discussed in <a href="http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/">Part 1</a>.  In this last installment, I’ll continue using Intel®TBB’s <em>flow graph</em> to put together some signal input and output devices, and then use those to make a small simulation featuring the four-bit adder from Part 2.</p>
<p>Let’s look at two input devices here, the <em>toggle</em> and the <em>pulse</em> (or as I would have liked to have called them, the <em>switch</em> and the <em>clock</em>).  A toggle sends a signal of high or low, toggling between the two states, every time it is “toggled” or flipped.  A pulse continually alternates between the high and low states at a given duration.  The <code>toggle</code> class is implemented as follows:</p>
<p>
<pre>
<blockquote>class toggle {
    graph&#038; my_graph;
    signal_t state;
    overwrite_node < signal_t > toggle_node;
 public:
    toggle(graph&#038; g) : my_graph(g), state(undefined), toggle_node(g) {}
    toggle(const toggle&#038; src) : my_graph(src.my_graph), state(undefined),
                                toggle_node(src.my_graph) {}
    ~toggle() {}
    // Assignment ignored
    toggle&#038; operator=(const toggle&#038; src) { return *this; }
    sender < signal_t > &#038; get_out() { return toggle_node; }
    void flip() {
        if (state==high) state = low;
        else state = high;
        toggle_node.try_put(state);
    }
    void activate() {
        state = low;
        toggle_node.try_put(state);
    }
};</blockquote>
</pre>
<p>The toggle is represented internally by an <code>overwrite_node</code>, because it simply needs to keep track of one most-recent state. As an input device, it doesn’t receive output from any other items, so it has no explicit input ports, only actions (flip, activate) which can alter the output state.  The output port can of course be acquired via <code>get_out</code>, so that the toggle can be used to send signals into a circuit.</p>
<p>The <code>pulse</code> class is a little more interesting:</p>
<p>
<pre>
<blockquote>class pulse {
    class clock_body {
        size_t& ms;
        int& reps;
        signal_t val;
    public:
        clock_body(size_t&#038; _ms, int&#038; _reps) : ms(_ms), reps(_reps), val(low) {}
        bool operator()(signal_t&#038; out) {
            rt_sleep(ms);  // our own portable sleep function
            if (reps>0) --reps;
            if (val==low) val = high;
            else val = low;
            out = val;
            return reps>0 || reps == -1;
        }
    };
    graph&#038; my_graph;
    size_t ms, init_ms;
    int reps, init_reps;
    source_node < signal_t > clock_node;

public:
    pulse(graph&#038; g, size_t _ms=1000, int _reps=-1) :
        my_graph(g), ms(_ms), init_ms(_ms), reps(_reps), init_reps(_reps),
        clock_node(g, clock_body(ms, reps), false)
    {}
    pulse(const pulse&#038; src) :
        my_graph(src.my_graph), ms(src.init_ms), init_ms(src.init_ms),
        reps(src.init_reps), init_reps(src.init_reps),
        clock_node(src.my_graph, clock_body(ms, reps), false)
    {}
    ~pulse() {}
    pulse&#038; operator=(const pulse&#038; src) {
        ms = src.ms; init_ms = src.init_ms;
        reps = src.reps; init_reps = src.init_reps;
        return *this;
    }
    sender < signal_t > &#038; get_out() { return clock_node; }
    void activate() { clock_node.activate(); }
    void reset() { reps = init_reps; }
};</blockquote>
</pre>
<p>This class is based on the <code>source_node</code>.  It generates a signal, alternating between low and high, every <code>ms</code> milliseconds.  There is also an option to repeat the alternation a certain number of times and then stop, which is useful for designing simulations that use a clock but also terminate.  The <code>source_node</code> body sleeps for a duration before flipping the signal and sending it.  It doesn’t begin sending signals immediately, but requires activation.  In the case of a non-infinite clock (<code>reps</code> is set), once the pulse object has run for the given number of repetitions, it can be reset and reactivated to use it again.</p>
<p>Next, we discuss two output devices, the <em>LED</em> and the <em>digit</em>.  The LED is simply a tiny light that is on while the signal it is receiving is high, and off when the signal is low. For simple text display, the LED looks like this: (*) when it is on and ( ) when it is off. The digit device receives a four-bit input and displays a single hexadecimal digit.  For simulations, both devices have the option of continuously displaying their state as it changes, or a silent mode, which displays only when a <code>display</code> method is called.</p>
<pre>
<blockquote>
class led {
    class led_body {
        signal_t &state;
        string &label;
        bool report_changes;
        bool touched;
    public:
        led_body(signal_t &#038;s, string &#038;l, bool r) :
            state(s), label(l), report_changes(r), touched(false)
        {}
        continue_msg operator()(signal_t b) {
            if (!touched || b!=state) {
                state = b;
                if (state != undefined &#038;& report_changes) {
                    if (state) printf("%s: (*)\n", label.c_str());
                    else printf("%s: ( )\n", label.c_str());
                }
                touched = false;
            }
            return continue_msg();
        }
    };
    graph&#038; my_graph;
    string label;
    signal_t state;
    bool report_changes;
    function_node < signal_t, continue_msg > led_node;
 public:
    led(graph&#038; g, string l, bool rc=false) : my_graph(g), label(l), state(undefined),
        report_changes(rc), led_node(g, 1, led_body(state, label, report_changes))
    {}
    led(const led&#038; src) : my_graph(src.my_graph), label(src.label), state(undefined),
        report_changes(src.report_changes),
        led_node(src.my_graph, 1, led_body(state, label, report_changes))
    {}
    ~led() {}
    led&#038; operator=(const led&#038; src) {
        label = src.label; state = undefined; report_changes = src.report_changes;
        return *this;
    }
    receiver < signal_t > &#038; get_in() { return led_node; }
    void display() {
        if (state == high) printf("%s: (*)\n", label.c_str());
        else if (state == low) printf("%s: ( )\n", label.c_str());
        else printf("%s: (u)\n", label.c_str());
    }
};</blockquote>
</pre>
<p>The <code>led</code> class contains a simple <code>function_node</code> that has no meaningful output (we use a <code>continue_msg</code> to indicate this) and thus no successors.  Another way to implement this would be with an <code>overwrite_node</code>, but we would lose the <code>report_changes</code> functionality.  Similarly, the <code>digit</code> class also cannot have successors, but we reused the <code>gate</code> base class to implement it, since it has multiple bits of input and needs to update its state whenever one of the inputs changes.</p>
<pre>
<blockquote>
class digit : public gate < four_input > {
    using gate < four_input > ::my_graph;
    typedef gate < four_input > ::ports_type ports_type;
    typedef gate < four_input > ::input_port_t input_port_t;
    class digit_body {
        signal_t ports[4];
        unsigned int &state;
        string &label;
        bool&#038; report_changes;
    public:
        digit_body(unsigned int &#038;s, string &#038;l, bool&#038; r) : state(s), label(l), report_changes(r) {
            for (int i=0; i < N; ++i) ports[i] = undefined;
        }
        void operator()(const input_port_t::output_type&#038; v, ports_type&#038; p) {
            unsigned int new_state = 0;
            if (v.indx == 0) ports[0] = std::get < 0 > (v.result);
            else if (v.indx == 1) ports[1] = std::get < 1 > (v.result);
            else if (v.indx == 2) ports[2] = std::get < 2 > (v.result);
            else if (v.indx == 3) ports[3] = std::get < 3 > (v.result);
            if (ports[0] == high) ++new_state;
            if (ports[1] == high) new_state += 2;
            if (ports[2] == high) new_state += 4;
            if (ports[3] == high) new_state += 8;
            if (state != new_state) {
                state = new_state;
                if (report_changes) {
                    printf("%s: %x\n", label.c_str(), state);
                }
            }
        }
    };
    string label;
    unsigned int state;
    bool report_changes;
 public:
    digit(graph&#038; g, string l, bool rc=false) :
        gate < four_input > (g, digit_body(state, label, report_changes)),
        label(l), state(0), report_changes(rc) {}
    digit(const digit&#038; src) :
        gate < four_input > (src.my_graph, digit_body(state, label, report_changes)),
        label(src.label), state(0), report_changes(src.report_changes) {}
    ~digit() {}
    digit&#038; operator=(const digit&#038; src) {
        label = src.label; state = 0; report_changes = src.report_changes;
        return *this;
    }
    void display() { printf("%s: %x\n", label.c_str(), state); }
};</blockquote>
</pre>
<p>Because <code>digit</code> inherits from <code>gate</code>, it reuses <code>gate</code>’s <code>get_in</code> methods to connect to the ports of a <code>digit</code> object.</p>
<p>Here’s an example code to test out the four-bit adder. First, create a graph:</p>
<p>
<pre>
<blockquote>graph g;</blockquote>
</pre>
<p>Then, create the four-bit adder, some toggles with which to set the inputs to the adder, and a digit and an LED to display the output:</p>
<p>
<pre>
<blockquote>four_bit_adder four_adder(g);
std::vector < toggle > A(4, toggle(g));
std::vector < toggle > B(4, toggle(g));
toggle CarryIN(g);
digit Sum(g, "SUM");
led CarryOUT(g, "CarryOUT");</blockquote>
</pre>
<p>Next, connect our toggles to the input ports of the adder, and connect the adder’s output ports to the display devices:</p>
<p>
<pre>
<blockquote>for (int i=0; i<4; ++i) {
    make_edge(A[i].get_out(), four_adder.get_A(i));
    make_edge(B[i].get_out(), four_adder.get_B(i));
    make_edge(four_adder.get_out(i), Sum.get_in(i));
}
make_edge(CarryIN.get_out(), four_adder.get_CI());
make_edge(four_adder.get_CO(), CarryOUT.get_in());</blockquote>
</pre>
<p>Almost ready to go, activate all the switches at the low state so that everything starts at zero:</p>
<p>
<pre>
<blockquote>for (int i=0; i<4; ++i) {
    A[i].activate();
    B[i].activate();
}
CarryIN.activate();</blockquote>
</pre>
<p>Now I can start flipping toggles.  I’ve set digit and led to display only when requested by default, because I don’t want to see all the changes before this circuit reaches a steady state.  Let’s try 8+5:</p>
<p>
<pre>
<blockquote>A[3].flip();
B[0].flip();
B[2].flip();</blockquote>
</pre>
<p>Wait for the circuit to reach a steady state:>/p></p>
<p>
<pre>
<blockquote>g.wait_for_all();</blockquote>
</pre>
<p>Now display the results:</p>
<p>
<pre>
<blockquote>Sum.display();
CarryOUT.display();</blockquote>
</pre>
<p>And here they are:</p>
<p>
<blockquote><strong>SUM: d<br />
CarryOUT: ( )</strong></p></blockquote>
<p>And with that, I’ll wrap up this blog by saying that the logic simulation example code is available as an example in Intel® TBB 4.0 Update 4, and that it has several other interesting features, like push button and constant signal input devices, NAND and NOR gates, and a D-latch circuit example.  Please let us know of other interesting use cases for the <code>or_node</code> and any other feedback you’d be willing to give.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/05/05/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-3-putting-together-a-simulation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital Logic Simulation with the Intel® TBB Flow Graph, Part 2: Building bigger components</title>
		<link>http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/#comments</comments>
		<pubDate>Fri, 04 May 2012 17:00:05 +0000</pubDate>
		<dc:creator>Terry Wilmarth (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[flow graph]]></category>
		<category><![CDATA[Intel TBB]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/</guid>
		<description><![CDATA[In Part 1, I described how to put together a basic logic gate using the Intel® Threading Building Blocks flow graph nodes or_node and multifunction_node. In this blog, I will assume the basic logic gates and_gate, or_gate and xor_gate exist, and use them to construct a four-bit adder. To begin with, I’ll first construct a [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/">Part 1</a>, I described how to put together a basic logic gate using the Intel® Threading Building Blocks flow graph nodes <code>or_node</code> and <code>multifunction_node</code>.  In this blog, I will assume the basic logic gates <code>and_gate</code>, <code>or_gate</code> and <code>xor_gate</code> exist, and use them to construct a four-bit adder.</p>
<p>To begin with, I’ll first construct a one-bit full adder as in Figure 2 below:</p>
<div id="attachment_47264" class="wp-caption aligncenter" style="width: 629px"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig2.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig2-e1335816473799.png" alt="One-bit full adder" title="DLSfig2" width="619" height="308" class="size-full wp-image-47264" /></a><p class="wp-caption-text">Figure 2</p></div>
<p>The inputs are A and B, and a Carry-in bit, and the output is the sum S, and a Carry-out bit.  Here is the code for the <code>one_bit_adder</code> class:</p>
<pre>
<blockquote>
class one_bit_adder {
    broadcast_node < signal_t > A_port;
    broadcast_node < signal_t > B_port;
    broadcast_node < signal_t > CI_port;
    xor_gate < two_input > FirstXOR;
    xor_gate < two_input > SecondXOR;
    and_gate < two_input > FirstAND;
    and_gate < two_input > SecondAND;
    or_gate < two_input > FirstOR;
    graph&#038; my_graph;
    void make_connections() {
        make_edge(A_port, FirstXOR.get_in(0));
        make_edge(A_port, FirstAND.get_in(0));
        make_edge(B_port, FirstXOR.get_in(1));
        make_edge(B_port, FirstAND.get_in(1));
        make_edge(CI_port, SecondXOR.get_in(1));
        make_edge(CI_port, SecondAND.get_in(1));
        make_edge(FirstXOR.get_out(), SecondXOR.get_in(0));
        make_edge(FirstXOR.get_out(), SecondAND.get_in(0));
        make_edge(SecondAND.get_out(), FirstOR.get_in(0));
        make_edge(FirstAND.get_out(), FirstOR.get_in(1));
    }
public:
    one_bit_adder(graph&#038; g) :
        my_graph(g), A_port(g), B_port(g), CI_port(g), FirstXOR(g),
        SecondXOR(g), FirstAND(g), SecondAND(g), FirstOR(g)
    {
        make_connections();
    }
    one_bit_adder(const one_bit_adder&#038; src) :
        my_graph(src.my_graph), A_port(src.my_graph), B_port(src.my_graph),
        CI_port(src.my_graph), FirstXOR(src.my_graph), SecondXOR(src.my_graph),
        FirstAND(src.my_graph), SecondAND(src.my_graph), FirstOR(src.my_graph)
    {
        make_connections();
    }
    ~one_bit_adder() {}
    receiver < signal_t > &#038; get_A() { return A_port; }
    receiver < signal_t > &#038; get_B() { return B_port; }
    receiver < signal_t > &#038; get_CI() { return CI_port; }
    sender < signal_t > &#038; get_out() { return SecondXOR.get_out(); }
    sender < signal_t > &#038; get_CO() { return FirstOR.get_out(); }
};</blockquote>
</pre>
<p>This implementation is almost a straightforward translation of the gates and their connections into the flow graph format.  The one complication is the addition of the <code>broadcast_node</code>s for each of the input ports.  The reason for this is simply to enable connection to a single port from outside of the adder.  Since each of the inputs is connected to two gates inside of the <code>one_bit_adder</code> object, there is no single port associated with them automatically. Adding the <code>broadcast_node</code>s enables us to provide the methods <code>get_A</code>, <code>get_B</code> and <code>get_CI</code> that each return a single port capable of receiving data.  So, in looking at the diagram above, you can think of the three <code>broadcast_node</code>s as standing in for the black junction circles that the three inputs are connected to directly.</p>
<p>To make the <code>four_bit_adder</code> class, simply chain together a set of four <code>one_bit_adder</code>s and connect the Carry-out port of each adder to the Carry-in port of the next adder, as shown in Figure 3 below:</p>
<p>
<div id="attachment_47265" class="wp-caption aligncenter" style="width: 606px"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig3.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig3-e1335816730363.png" alt="Four-bit adder" title="DLSfig3" width="596" height="475" class="size-full wp-image-47265" /></a><p class="wp-caption-text">Figure 3</p></div>
<p>This time, the class is even more straightforward to implement, because no <code>broadcast_node</code>s are needed; every input already has exactly one internal connection.</p>
<p>
<pre>
<blockquote>class four_bit_adder {
    graph&#038; my_graph;
    std::vector < one_bit_adder > four_adders;
    void make_connections() {
        make_edge(four_adders[0].get_CO(), four_adders[1].get_CI());
        make_edge(four_adders[1].get_CO(), four_adders[2].get_CI());
        make_edge(four_adders[2].get_CO(), four_adders[3].get_CI());
    }
 public:
    four_bit_adder(graph&#038; g) : my_graph(g), four_adders(4, one_bit_adder(g)) {
        make_connections();
    }
    four_bit_adder(const four_bit_adder&#038; src) :
        my_graph(src.my_graph), four_adders(4, one_bit_adder(src.my_graph))
    {
        make_connections();
    }
    ~four_bit_adder() {}
    receiver < signal_t > &#038; get_A(size_t bit) {
        return four_adders[bit].get_A();
    }
    receiver < signal_t > &#038; get_B(size_t bit) {
        return four_adders[bit].get_B();
    }
    receiver < signal_t > &#038; get_CI() {
        return four_adders[0].get_CI();
    }
    sender < signal_t > &#038; get_out(size_t bit) {
        return four_adders[bit].get_out();
    }
    sender < signal_t > &#038; get_CO() {
        return four_adders[3].get_CO();
    }
};</blockquote>
</pre>
<p>Here, the constructor makes a vector of exactly four adders, and connects the Carry-out ports to the Carry-in ports as appropriate.  The multi-bit inputs and outputs have port access methods that take a bit as a parameter.  So for example, to get the input port for bit 2 of input B, you would use <code>get_B(2)</code>.</p>
<p>In <a href="http://software.intel.com/en-us/blogs/2012/05/05/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-3-putting-together-a-simulation/">Part 3</a>, I will present some interesting input and output devices to add to the logic simulation library, and with those, I’ll put together a small simulation that shows the <code>four_bit_adder</code> in action.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital Logic Simulation with the Intel® TBB Flow Graph, Part 1: Using the or_node</title>
		<link>http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/#comments</comments>
		<pubDate>Thu, 03 May 2012 17:00:56 +0000</pubDate>
		<dc:creator>Terry Wilmarth (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[flow graph]]></category>
		<category><![CDATA[Intel TBB]]></category>
		<category><![CDATA[or_node]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/</guid>
		<description><![CDATA[In this multi-part blog, I’m going to show you how to put together a simple logic simulation program using the Intel® Threading Building Blocks flow graph feature. Please note that this example does NOT demonstrate a practical approach to digital logic simulation. The purpose of the example is to demonstrate the use of several types [...]]]></description>
			<content:encoded><![CDATA[<p>In this multi-part blog, I’m going to show you how to put together a simple logic simulation program using the Intel® Threading Building Blocks <em>flow graph</em> feature. Please note that this example does NOT demonstrate a practical approach to digital logic simulation.  The purpose of the example is to demonstrate the use of several types of flow graph nodes and how they can be composed to make more interesting components.  I’ll start by designing basic logic gates that are composed of flow graph nodes.</p>
<p>Consider an AND gate.  In its simplest form, it takes two inputs, and produces a single output.  The first thing that comes to mind to represent this is the flow graph <code>function_node</code>:  it could take a pair as input, and a body that computes the logical AND operation on the items in the pair, and puts out the result as its output.  That might work, but let’s think a little more about how such a gate might receive its two input signals: a <code>function_node</code> takes a single argument, so I’d have to group the two inputs together.  However, both inputs will be coming from different senders, and may not be available at the same time. Should I preface the <code>function_node</code> with a <code>join_node</code>? Possibly, but there’s still a limitation with a <code>join_node</code>: it gathers together the inputs and when it has received the full complement, it then sends them along as a tuple.  But this still isn’t exactly the behavior I want.  What I really want is when either of the inputs becomes available, the <code>function_node</code> should be told about it, because it will need to change its output value when any of its input values change. </p>
<p>Thus, the first decision about gates is this: Gates are responsive: when any input changes, the gate will check if its output needs to change. To simplify this a little, and make our flow graph have to do a little less work, I’ll make this second decision: Gates are lazy; a gate will send data to its output port only when that data differs from the previous value sent to that output port.  This will certainly reduce the number of tasks doing redundant work in the graph. </p>
<p>So, on the input side, something reports changes on any input port, and on the output side, something produces output, or not, depending on if the output value has changed. Neither of these behaviors corresponds exactly to a <code>function_node</code>.  However, the new feature <code>multifunction_node</code> (formerly the <a href="http://software.intel.com/en-us/articles/intel-tbb-community-preview-features/">Community Preview feature</a> (CPF) <code>multioutput_function_node</code>) can certainly meet the output needs: it can optionally produce an output.  For the input, if the title of this blog hasn’t given it away already, my choice is the <code>or_node</code>.  The <code>or_node</code> will pass along any input it receives on any input port at any time, giving exactly the responsiveness I need.  The <code>or_node</code> is currently a CPF in Intel® TBB.</p>
<div id="attachment_47231" class="wp-caption aligncenter" style="width: 280px"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig1.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/DLSfig1-e1335811781181.png" alt="gate template" title="DLSfig1" width="270" height="97" class="size-full wp-image-47231" /></a><p class="wp-caption-text">Figure 1</p></div>
<p>Figure 1 illustrates this basic logic gate design.  Note that the <code>or_node</code> takes a variable number of inputs – no need to limit it to two – and the <code>multifunction_node</code> takes a body that produces either no output or one output. In general, the <code>multifunction_node</code> can produce zero or more outputs of varying types, but for the gate implementation, zero or one output will suffice.  Let’s take a look at the actual code for this, the <code>gate</code> template class.</p>
<p>First, I set up a type <code>signal_t</code> to represent the signal data being transferred.  Since I’m allowing the gates to fire only when the output state changes, it helps to have an additional <code>undefined</code> state for initialization.</p>
<p>
<pre>
<blockquote>typedef enum { low=0, high, undefined } signal_t;</blockquote>
</pre>
<p>Next, I define a few potential input configurations to gates.  I could go all out and add <code>eight_input</code> gates, but I couldn’t dredge up a use for them from the dark and rarely-visited corner of my brain where I keep the knowledge leftover from a digital logic course so many years ago.</p>
<pre>
<blockquote>typedef tuple < signal_t > one_input;
typedef tuple < signal_t, signal_t > two_input;
typedef tuple < signal_t, signal_t, signal_t > three_input;
typedef tuple < signal_t, signal_t, signal_t, signal_t > four_input;
</blockquote>
</pre>
<p>Now I’m ready to set up the gate template.</p>
<pre>
<blockquote>template < typename GateInput >
class gate {
protected:
    typedef or_node < GateInput > input_port_t;
    typedef multifunction_node < typename input_port_t::output_type, tuple < signal_t > > gate_fn_t;
    typedef typename gate_fn_t::output_ports_type ports_type;
public:
    static const int N = std::tuple_size < GateInput > ::value;

    template < typename Body >
    gate(graph&#038; g, Body b) : my_graph(g), in_ports(g), gate_fn(g, 1, b) {
        make_edge(in_ports, gate_fn);
    }
    virtual ~gate() {}
    virtual gate&#038; operator=(const gate&#038; src) { return *this; }
    sender < signal_t > &#038; get_out() { return output_port < 0 > (gate_fn); }
    receiver < signal_t > &#038; get_in(size_t port) {
        return gate_helper < N > ::get_inport(in_ports, (int)port);
    }
protected:
    graph&#038; my_graph;
private:
    input_port_t in_ports;
    gate_fn_t gate_fn;
};</blockquote>
</pre>
<p>The class is templated by the input configuration, <code>GateInput</code>, so for example, I would pass in <code>two_input</code> if I wanted to make a gate with two inputs. Then I define two types. First, <code>input_port_t</code>, which is the type of the <code>or_node</code> that that I’ll pass the input configuration to, as specified by <code>GateInput</code>. Second is <code>gate_fn_t</code>, which is the <code>multifunction_node</code> that takes the output from the <code>or_node</code>, performs the function of the gate, and outputs a single <code>signal_t</code> (or nothing).  These types are used to declare the actual graph nodes <code>in_ports</code> and <code>gate_fn</code>, in the private section of the class above.</p>
<p>The <code>gate</code> constructor initializes the two graph nodes, making them belong to a graph <code>g</code> that is passed in as a reference parameter.  Additionally, the constructor takes a function object <code>b</code> that performs the actual logical operation on the inputs to the gate, and determines what the new output will be.  So in the case of an AND gate, I would pass in a function object that computes a logical AND operation.  The constructor also completes this small component by connecting the two graph nodes with the <code>make_edge</code> function.</p>
<p>In order to connect this gate to other components, I’ve provided methods to access the input ports and the output port.  <code>get_in</code> takes a port number and returns a reference to an input port capable of receiving data, i.e. a <code>receiver<signal_t>&#038;</code> in the flow graph jargon.   It uses the <code>gate_helper<N>::get_inport</code> function shown below to extract the input port to the <code>or_node</code>.  <code>get_out</code> returns a reference to the output port of the <code>multifunction_node</code> which is capable of sending data, i.e. a <code>sender<signal_t>&#038;</code>.</p>
<p>
<pre>
<blockquote>template < int N >
struct gate_helper {
    template < typename TupleType >
    static inline receiver < signal_t > &#038; get_inport(or_node < TupleType > &#038; in_ports, int port) {
        if (N-1 == port) return input_port < N-1 > (in_ports);
        else return gate_helper < N-1 > ::get_inport(in_ports, port);
    }
};
template < >
struct gate_helper < 1 > {
    template < typename TupleType >
    static inline receiver < signal_t > &#038; get_inport(or_node < TupleType > &#038; in_ports, int port) {
        return input_port < 0 > (in_ports);
    }
};
</blockquote>
</pre>
<p>Now that I have a building block for creating a wide variety of logic gates, I’ll use it for designing an AND gate.  When creating the derived class <code>and_gate</code>, the main purpose is to define the functor that gets passed to the <code>gate_fn</code> object inside the <code>gate</code> base class.  <code>and_body</code> computes a logical AND operation over all the inputs to the gate, including undefined inputs, so the function is not completely trivial.</p>
<p>
<pre>
<blockquote>template < typename GateInput >
class and_gate : public gate < GateInput > {
    using gate < GateInput > ::N;
    using gate < GateInput > ::my_graph;
    typedef typename gate < GateInput > ::ports_type ports_type;
    typedef typename gate < GateInput > ::input_port_t input_port_t;
    class and_body {
        signal_t ports[N];
        signal_t state;
        bool touched;
    public:
        and_body() : state(undefined), touched(false)
            for (int i=0; i < N; ++i) ports[i] = undefined;
        }
        void operator()(const typename input_port_t::output_type&#038; v, ports_type&#038; p) {
            ports[v.indx] = or_output_helper < N > ::get_or_output(v);
            signal_t new_state=high;
            size_t i=0;
            while (i < N) {
                if (ports[i] == low)
                    new_state = low; break;
                else if (ports[i] == undefined &#038;& new_state != low)
                    new_state = undefined;
                ++i;
            }
            if (!touched || state != new_state) {
                state = new_state;
                std::get < 0 > (p).try_put(state);
                touched = true;
            }
        }
    };
 public:
    and_gate(graph&#038; g) : gate < GateInput > (g, and_body()) {}
    and_gate(const and_gate < GateInput > &#038; src) : gate < GateInput > (src.my_graph, and_body()) {}
    ~and_gate() {}
};</blockquote>
</pre>
<p>The <code>and_body</code> keeps track of the states of the gate’s input ports and output port.  These are all initially <code>undefined</code>.  The <code>operator()</code> for <code>and_body</code> receives the <code>or_node</code> output in parameter <code>v</code>, which indicates that data was received on one of the input ports.  The input port that received data is specified in <code>v.indx</code>.  Accessing the data from that port is a little more challenging, as the entire input tuple is passed in <code>v.result</code>.  I wrote a helper function <code>or_output_helper<N>::get_or_output</code> to select the <code>v.indx</code>-th port of the tuple <code>v.result</code>.  This value is used to update the locally stored state of the appropriate port, and then the new output state is calculated.  The new state is checked to see if it differs from the old state, and if so, the new state is sent out on the appropriate output port of the <code>multifunction_node</code> (which in this case, since there is only one output, is always port zero).  Note also that the very first time a gate receives data, i.e. when <code>touched</code> is false, the new state is sent out even if it is not different from the initial state.  This is useful when the gate is a part of a larger circuit.  It allows any initial settings on input ports to propagate through the graph and register at any possible output devices that might exist.</p>
<p>The helper function that extracts the <code>or_node</code> output is as follows:</p>
<p>
<pre>
<blockquote>template < int N >
struct or_output_helper {
    template < typename OrOutputType >
    static inline signal_t get_or_output(const OrOutputType&#038; out) {
        if (N-1 == out.indx) return std::get < N-1 > (out.result);
        else return or_output_helper < N-1 > ::get_or_output(out);
    }
};
template < >
struct or_output_helper < 1 > {
    template < typename OrOutputType >
    static inline signal_t get_or_output(const OrOutputType&#038; out) {
        return std::get < 0 > (out.result);
    }
};</blockquote>
</pre>
<p>Given an AND gate, it’s easy to see how to make OR gates and any other sort of basic logic gate from the base class <code>gate</code>.</p>
<p>As the <code>or_node</code> is currently a Community Preview feature, it’s a good time to have a look at it and give us your feedback.</p>
<p>In <a href="http://software.intel.com/en-us/blogs/2012/05/04/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-2-building-bigger-components/">Part 2</a> of this blog, I’ll show you how to put together a variety of basic logic gates to make a four-bit adder.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/05/03/digital-logic-simulation-with-the-intel-tbb-flow-graph-part-1-using-the-or_node/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Aggregator: a new Community Preview Feature in Intel® Threading Building Blocks</title>
		<link>http://software.intel.com/en-us/blogs/2012/05/02/aggregator-a-new-community-preview-feature-in-intel-threading-building-blocks/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/05/02/aggregator-a-new-community-preview-feature-in-intel-threading-building-blocks/#comments</comments>
		<pubDate>Wed, 02 May 2012 17:00:48 +0000</pubDate>
		<dc:creator>Terry Wilmarth (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Community preview feature]]></category>
		<category><![CDATA[Concurrency control]]></category>
		<category><![CDATA[Intel TBB]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/05/02/aggregator-a-new-community-preview-feature-in-intel-threading-building-blocks/</guid>
		<description><![CDATA[Intel® Threading Building Blocks (Intel® TBB) 4.0 Update 4 introduces a new Community Preview feature, the aggregator. An internal version of the aggregator has been in use in Intel® TBB for some time, appearing in the flow graph and concurrent priority queue implementations. An aggregator is like a mutex in that it enforces mutually exclusive [...]]]></description>
			<content:encoded><![CDATA[<p>Intel® Threading Building Blocks (Intel® TBB) 4.0 Update 4 introduces a new <a href="http://software.intel.com/en-us/articles/intel-tbb-community-preview-features/?wapkw=community+preview+feature">Community Preview feature</a>, the <em>aggregator</em>.  An internal version of the aggregator has been in use in Intel® TBB for some time, appearing in the flow graph and concurrent priority queue implementations. An aggregator is like a mutex in that it enforces mutually exclusive access to a critical section of program code.  However, it can perform better than a mutex in many cases. It differs significantly from a mutex in how it works, and that can have deeper implications on how it performs and how it can be used.  It does its magic by aggregating the critical sections from multiple threads into a single critical section executed by a single thread, which can have a significant impact on cache performance. </p>
<p>There are two modes of use for this feature: basic mode and expert mode.  Basic mode is straightforward and not much more complex than using a mutex. Expert mode requires some understanding of how the aggregator works, and additional coding, but can enable additional performance improvements. In this blog, I will first illustrate how to use the aggregator in the basic mode.  Then I’ll give a brief overview of how the aggregator works, followed by an example of how to use the aggregator in the expert mode.  Finally, I’ll examine the performance of the aggregator and suggest approaches to help decide whether or not to use it.</p>
<p><strong>Side-by-side Comparison of Basic Aggregator Usage with Mutex Usage</strong></p>
<p>In this simple example, I’ll compare the usage of a mutex with an aggregator to lock <code>push</code> and <code>pop</code> operations on a serial priority queue object of type <code>std::priority_queue</code>.  This example uses C++1x features, such as lambdas, but one could use function objects instead.  Fair warning: I’m interspersing code snippets below, because this blog format doesn’t allow for side-by-side code comparison.  Please don’t try to use both a mutex and an aggregator to protect the same code.</p>
<p>First, declare the priority queue.  I'll use a simple integer priority queue here:</p>
<p>
<pre>
<blockquote>typedef int value_type;
typedef priority_queue < value_type, std::vector < value_type > , compare_type > pq_t;
pq_t my_pq;</blockquote>
</pre>
<p>Declare a mutex to protect <code>my_pq</code>:</p>
<p>
<pre>
<blockquote>spin_mutex my_mutex;</blockquote>
</pre>
<p>Alternatively, declare an aggregator to protect <code>my_pq</code>:</p>
<p>
<pre>
<blockquote>aggregator my_aggregator;</blockquote>
</pre>
<p>Declare an element to push/pop from queue:</p>
<p>
<pre>
<blockquote>value_type elem = 42;</blockquote>
</pre>
<p>Now, push an element on the queue using the mutex:</p>
<p>
<pre>
<blockquote>{
    tbb::spin_mutex::scoped_lock my_lock(my_mutex);
    my_pq.push(elem);
}</blockquote>
</pre>
<p>Or, push the element on the queue using the aggregator and a lambda expression:</p>
<p>
<pre>
<blockquote>my_aggregator.execute( [&#038;my_pq, &#038;elem](){
    my_pq.push(elem);
} );</blockquote>
</pre>
<p>Pop an element off the queue using the mutex:</p>
<p>
<pre>
<blockquote>bool result = false;
{
    tbb::spin_mutex::scoped_lock my_lock(*my_mutex);
    if (!my_pq.empty()) {
        result = true;
        elem = my_pq.top();
        my_pq.pop();
    }
}</blockquote>
</pre>
<p>Pop an element off the queue using the aggregator:</p>
<p>
<pre>
<blockquote>bool result = false;
my_aggregator.execute( [&#038;my_pq, &#038;elem, &#038;result](){
    if (!my_pq.empty()) {
        result = true;
        elem = my_pq.top();
        my_pq.pop();
    }
} );</blockquote>
</pre>
<p><strong>How the Aggregator Works</strong></p>
<p>As we see above, the usage of the aggregator in basic mode is trivially different from using a mutex.  However, it is clearly working in a different way.  In order to execute a critical section, you pass it to an aggregator via the <code>execute</code> method.  When the <code>execute</code> method returns, the critical section has been executed, but how this happened is hidden inside the black box of the aggregator.  </p>
<p>Looking at the header file <code>aggregator.h</code> that defines the <code>aggregator</code>, these details become clear.  To use the aggregator in expert mode, you should have some familiarity with the header file, and I'll guide you through the most important features in the rest of this blog.</p>
<p>First note that aggregator inherits from a class <code>aggregator_ext</code> that takes a template parameter.  <code>Aggregator</code> instantiates that template parameter with a simple handler defined in the header, <code>handler_type = internal::basic_handler</code>.  We will discuss this more later.</p>
<p>The <code>execute</code> method of <code>aggregator</code> takes a function body as parameter, and encapsulates <code>body</code> in a <code>basic_operation</code> object, which inherits from <code>aggregator_operation</code>.  <code>Aggregator_operation</code>s are sent to the <code>aggregator_ext</code>’s <code>mailbox</code> where they may concurrently accumulate while they await execution.  One thread, the <em>active handler</em>, i.e. the first thread to place an <code>aggregator_operation</code> in the empty <code>mailbox</code>, will grab all the operations that have accumulated there, effectively emptying the <code>mailbox</code>.  It will then go through all the operations that it grabbed, and serially execute the function bodies stored in those objects.  The mechanism used to execute function bodies is specified by <code>aggregator_ext</code>’s template parameter, which in the default case is called <code>basic_handler</code>.</p>
<p>This <code>basic_handler</code> is straightforward in its functioning: it is passed the list of <code>aggregator_operation</code>s, and it loops through this list and handles each item.  It makes use of a few methods on <code>aggregator_operation</code> to do this properly: <code>next</code> is used to traverse to the next operation in the list, <code>start</code> prepares the operation to be handled, and <code>finish</code> is called after the operation is handled to inform the thread waiting on the execution of the operation that the operation is completed.  When all operations are handled, the active handler thread can leave the <code>aggregator</code>, since its own call to <code>execute</code> has been satisfied in the process.</p>
<p>The details of the synchronization that make this all possible can be found in <code>aggregator.h</code>.  We won’t explain them fully here, because we already have enough information to proceed to use the aggregator in expert mode.  It is enough to know that threads hand over critical sections to the aggregator, and one of these threads will execute all the operations serially on behalf of the other threads as a single critical section.</p>
<p><strong>Using the Aggregator in Expert Mode</strong></p>
<p>I’ll use the same example as before, allowing threads to safely push and pop to a serial <code>std::priority_queue</code>.  The expert mode of aggregator allows the user to pass any sort of data in to the aggregator as an <code>aggregator_operation</code> via the <code>process</code> method (note the different method name – we were using <code>execute</code> in basic mode), along with an aggregating function object that is called by the active handler to perform the serial execution of operations.  In this case, I’ll pass data about a push or pop operation to the aggregator via <code>process</code>, and provide a custom function object to perform the operations.</p>
<p>First, create a class derived from <code>aggregator_operation</code> to hold the operation data.</p>
<p>
<pre>
<blockquote>class op_data : public aggregator_operation {
public:
    value_type* elem;
    bool success;
    bool is_push;
    op_data(value_type* e, bool push=false): elem(e), success(false), is_push(push) {}
};</blockquote>
</pre>
<p>Then, create a handler to pass in as the aggregator’s template parameter:</p>
<p>
<pre>
<blockquote>class my_handler_t {
    pq_t *pq;
public:
    my_handler_t() {}
    my_handler_t(pq_t *pq_) : pq(pq_) {}
    void operator()(aggregator_node* op_list) {
        op_data* tmp;
        while (op_list) {
            tmp = (op_data*)op_list;
            op_list = op_list->next();
            tmp->start();
            // handle tmp here
            if (tmp->is_push) pq->push(*(tmp->elem));
            else {
                if (!pq->empty()) {
                    tmp->success = true;
                    *(tmp->elem) = pq->top();
                    pq->pop();
                }
            }
            // done handling tmp
            tmp->finish();
        }
    }
};</blockquote>
</pre>
<p>Now, to create an aggregator, use the <code>aggregator_ext</code> type name and pass this handler’s type in as the template parameter, and initialize the handler and pass it in as an argument to the constructor:</p>
<p>
<pre>
<blockquote>aggregator_ext < my_handler_t > my_aggregator(my_handler_t(my_pq));</blockquote>
</pre>
<p>To perform a push, simply create the <code>op_data</code> node with the push information and pass it to <code>process</code>:</p>
<p>
<pre>
<blockquote>op_data my_push_op(&#038;elem, true);
my_aggregator.process(&#038;my_push_op);</blockquote>
</pre>
<p>And to perform a pop:</p>
<p>
<pre>
<blockquote>bool result;
op_data my_pop_op(&#038;elem);
my_aggregator.process(&#038;my_pop_op);
result = my_pop_op.success;</blockquote>
</pre>
<p><strong>When to use Aggregator and why use Expert Mode?</strong></p>
<p>A good way to start is to compare the performance of your code using your current locking mechanism to a version of your code that uses an aggregator instead.  In practice, we (developers of TBB) have often found that a mutex is sufficient and outperforms aggregator when contention on the critical region is low. For higher contention, we often find that the use of the aggregator is justified.</p>
<p>The aggregator provides most of its performance improvements in hot cache execution of operations on a single thread. (Recall the <em>active handler</em>?)  Thus, the more concurrent contention on your critical region, the larger the aggregations will be that are assembled, and the greater the benefits of executing operations with a hot cache on a single thread.</p>
<p>If you do find that the basic aggregator improves your code’s performance, consider moving to the expert level.  To begin with, you can simply transform your code as I’ve shown in the expert example above.  This should result in better performance over the basic interface.  The reason for this is that, in the basic interface, the function object or lambda expression you wish to execute and all the references to data that you want that code to access are stored on the stack of the thread that originated the operation. Referring back to the basic example, this means that for each operation, we look up a different reference to the same priority queue.  But, in the expert example above, note that we store just a few data references in the <code>aggregator_operation</code>, and the code to execute the operation and references to the shared data (<code>my_pq</code>) are local to the aggregating functor and only need to be looked up once to handle all the operations in an aggregation.  This enhances the hot cache effect by reducing the quantity of non-local stack accesses.</p>
<p>The expert-level usage of aggregator shown above is quite straightforward.  However, you are free to handle operations in the aggregating handler in whatever manner you like.  Consider the aggregation of operations an opportunity to develop new and interesting serial algorithms.  This gives you a unique opportunity to make use of a kind of <em>lookahead</em> capability: you know the set of operations that you need to perform. For example, Intel® TBB’s <code>concurrent_priority_queue</code> handles the operations in two passes, performing some of them and postponing others, because some orderings of operations are more efficient than others.  The only rules for processing operations in the aggregating handler are that they should all be handled, and, in some cases, there should be some serial sequence of the operations that achieves the same result (i.e. sequential consistency).</p>
<p>I’d like to hear about your experiences using aggregator, so if you get a chance, give it a try, and let me know how it went!  You can comment here, or better yet, start a discussion on the <a href="http://software.intel.com/en-us/forums/intel-threading-building-blocks/">Intel® TBB forum</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/05/02/aggregator-a-new-community-preview-feature-in-intel-threading-building-blocks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>TACC symposium and programming two SMP-on-a-chip devices</title>
		<link>http://software.intel.com/en-us/blogs/2012/04/26/tacc-symposium-and-programming-two-smp-on-a-chip-devices/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/04/26/tacc-symposium-and-programming-two-smp-on-a-chip-devices/#comments</comments>
		<pubDate>Fri, 27 Apr 2012 04:28:46 +0000</pubDate>
		<dc:creator>James Reinders (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Performance and Optimization]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Cilk Plus]]></category>
		<category><![CDATA[Intel MIC]]></category>
		<category><![CDATA[Knights Corner]]></category>
		<category><![CDATA[Knights Ferry]]></category>
		<category><![CDATA[many-core]]></category>
		<category><![CDATA[multi-core]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[SCC]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/04/26/tacc-symposium-and-programming-two-smp-on-a-chip-devices/</guid>
		<description><![CDATA[one presenter exclaimed “Time spent optimizing for MIC is time well spent because it optimizes your code for non-MIC processors at the same time.”]]></description>
			<content:encoded><![CDATA[<p>Real results for many-core processors illustrate the power of a familiar configuration (SMP) even when reduced to a single chip. SMP on-a-chip can use the same applications, same tools, offer the same flexibility and pose familiar challenges that are solved by familiar techniques and skills.</p>
<p><a href="http://www.tacc.utexas.edu/ti-hpcs12/program"><img class="size-full wp-image-47144 aligncenter" title="Screen shot 2012-04-26 at 8.05.50 PM" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/Screen-shot-2012-04-26-at-8.05.50-PM.png" alt="" width="500" /></a><a href="http://www.tacc.utexas.edu/ti-hpcs12/program"><img class="aligncenter size-full wp-image-47145" title="tacc" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/tacc.jpg" alt="" width="500" /></a></p>
<p>I recently attended a symposium, co-sponsored by TACC and Intel, at the Texas Advanced Computing Center (TACC) in Austin where the programming of two many-core devices were discussed. One was a research chip designed to push some limits and allow interesting research on a device that lacks many things a product would require. The research chip is known as Intel’s <a href="http://techresearch.intel.com/ProjectDetails.aspx?Id=1">Single-Chip Cloud Computer (SCC)</a>. The other many-core device was, a prototype of our new Intel Many Integrated Core (MIC) Architecture, the Knights Ferry co-processor. The deadline for papers precluded inclusion of results from pre-production Knights Corner co-processors which will be the first Intel MIC co-processor products. There was a lot of whispering in the hallways about the excitement of starting work with Knights Corner co-processors.</p>
<p>The papers, and the half day tutorial, at the “<a href="http://www.tacc.utexas.edu/ti-hpcs12/program">TACC-Intel Highly Parallel Computing Symposium</a>” all had strong elements relating to familiar parallel programming challenges: scaling and vectorization. This is because both devices are built on Intel Pentium processor cores hooked together with their design for a connection fabric on the same piece of silicon.</p>
<p>Simply put, they are both SMP on-a-chip (symmetric multi-processors) devices, with somewhat different design goals.</p>
<p>At Intel, we have been convinced that putting a familiar generally programmable SMP on-a-chip is a good idea. It has a familiarity in programmability which proves to have many benefits. SCC was built for research into many facets of highly parallel devices. Knights Corner is designed for production usage and is optimized for power and highly parallel workloads. Knights Corner is well suited for HPC applications that already run on SMP systems. Presenter after presenter who talked about using the prototype Knight Ferry mentioned how applications “just worked."</p>
<p>I like to say, “Programming is hard, and so is parallel programming.” It follows that making an SMP or an SMP on-a-chip get maximum performance may not quite be rocket science, but it is no walk in the park. So, there was plenty of room for the papers to discuss the challenges of tuning for any SMP system.</p>
<p>What was really striking was how optimizations for Knights Ferry co-processors were applicable to SMP systems in general. Several authors commented on how their work to get better scaling or better vectorization for Knights Ferry also improved the performance of the same code compiled to run on an Intel Xeon processor based SMP system.  This performance-reuse is very significant, and one presenter exclaimed “Time spent optimizing for MIC is time <em>well spent</em> because it optimizes your code for non-MIC processors at the same time.”</p>
<p>All the papers and presentations (including my keynote) are available on-line now at <a href="http://www.tacc.utexas.edu/ti-hpcs12/program">http://www.tacc.utexas.edu/ti-hpcs12/program</a></p>
<p>Here are some notes from a few of the talks:</p>
<p>Dr. Robert Harkness, gave an engaging talk entitled “Experiences with ENZO on the Intel Many Integrated Core Architecture.” I enjoyed his comment that “we always programming for the future” because they “never have enough compute power.” He looked at multiple programming models, but had the best results using the “dusty” MPI based program that he had running on an SMP before Knights Ferry. He did his work on MPICH 1.2.7p1 because Intel did not supply an MPI with the Knights Ferry systems. He said it was obsolete but very easy to build and use. He reported that one person (not a dedicated programmer) was able to build everything (a quarter million lines of code) in a single week without any application source code modifications at all. The week, it seems, was spent hunting down libraries and recompiling them including MPICH. His results scaled very well.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/slide01.png"><img class="aligncenter size-full wp-image-47130" title="slide0" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/slide01.png" alt="" width="500" /></a></p>
<p>His conclusions (from slide 30 of his presentation) were: “Intel MIC is the best way forward for large-scale codes which cannot use the existing GPGPU model (even with directives).”</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/slide1.png"><img class="aligncenter size-full wp-image-47131" title="slide1" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/slide1.png" alt="" width="500" /></a></p>
<p>A talk by Theron Voran, with the National Center of Atmospheric Research, looked at using Knights Ferry for Climate Science. He started by saying "We have large bodies of code laying around. We don't want to rewrite in new languages for restrictive architectures." He had several good introduction slides including a comparison of accelerators vs. multicore and many-core devices.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/Slide21.png"><img class="aligncenter size-full wp-image-47133" title="Slide2" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/Slide21.png" alt="" width="500" /></a></p>
<p>Here the challenges of vectorization offered opportunities for future work. Compiler hints, loop restructuring and relate activities should enhance performance on Xeon-based and MIC-based SMP systems, as well as work on improving scalability on more and more cores. Even with these challenges, the authors noted “Relative ease in porting codes” (recompiling) and the belief that computational capabilities of MIC will be worthwhile.</p>
<p>Ryan Hulguin, with the University of Tennessee, looked at CFD solvers on Knights Ferry. He looked at two methods, one based on Euler equations (for inviscid fluid flows) and another based on the BGK model Boltzmann equation (for rarefied gas flows). Performance results showed OpenMP to be effective on Knights Ferry, and that the SMP programming challenges of vectorization and having good concurrency held true on Knights Ferry as well.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/slide4.png"><img class="aligncenter size-full wp-image-47134" title="slide4" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/slide4.png" alt="" width="500" /></a></p>
<p>A talk on Dense Linear Algebra Factorization, from David Hudak at the Ohio Supercomputing Center, talked about Heterogeneous Programming Challenges. David is a Wolverine working in a Buckeye world. My heart goes out to him. I really enjoyed his separation of short-term issues that distract us from the real long-term challenges that will stay with us.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/slide5.png"><img class="aligncenter size-full wp-image-47135" title="slide5" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/slide5.png" alt="" width="500" /></a></p>
<p>The talk compared a QR factorization implemented in OpenMP with a Cilk Plus implementation. Both performed well. The authors emphasized that guidance to Vectorize and use lots of tasks, proved to work.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/slide31.png"><img class="aligncenter size-full wp-image-47138" title="slide3" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/slide31.png" alt="" width="500" /></a></p>
<p>I’ve written more than I set out to write, so I’ll stop here. The SCC related papers were very interesting as well, ranging from Tim Mattson’s overview of the program to papers showing research results from investigations using SCC. The other MIC related papers are all worthy as well, including an excellent paper on early experiences with MVAPICH2 doing Intra-MIC MPI communication. Amazing things you can do on an SMP on-a-chip… it runs a real Linux after all!</p>
<p>It is very common for demos to start with an ‘ssh’ (shell) to one of the Knights Ferry processors… and then running the application natively from the command line. SMP on-a-chip, indeed.  Too bad I can’t convince Intel to name it that.  Even if I did, it would probably be chipSMP™ model 8650plus XS. Nevermind, Knights Corner is fine by me.</p>
<p>The papers and talks can be at <a href="http://www.tacc.utexas.edu/ti-hpcs12/program">http://www.tacc.utexas.edu/ti-hpcs12/program</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/04/26/tacc-symposium-and-programming-two-smp-on-a-chip-devices/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Intel Announces the New Intel® SDK for OpenCL* Applications 2012</title>
		<link>http://software.intel.com/en-us/blogs/2012/04/25/intel-announces-the-new-intel-sdk-for-opencl-applications-2012/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/04/25/intel-announces-the-new-intel-sdk-for-opencl-applications-2012/#comments</comments>
		<pubDate>Wed, 25 Apr 2012 11:38:33 +0000</pubDate>
		<dc:creator>Arnon Peleg (Intel)</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Game Development]]></category>
		<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Performance and Optimization]]></category>
		<category><![CDATA[Server]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA["Intel OpenCL SDK"]]></category>
		<category><![CDATA["Intel OpenCL"]]></category>
		<category><![CDATA[openCL]]></category>
		<category><![CDATA[vcsource_product_oclsdk]]></category>
		<category><![CDATA[vcsource_type_event]]></category>
		<category><![CDATA[vcsource_type_news]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/04/25/intel-announces-the-new-intel-sdk-for-opencl-applications-2012/</guid>
		<description><![CDATA[In support of the recent announcement of the 3rd Generation Intel® Core™ Processors, Intel has released the Intel® SDK for OpenCL* Applications 2012. For the first time, OpenCL* developers using Intel® architecture can utilize compute resources across both Intel® Processors and Intel® HD Graphics Driver 4000/2500]]></description>
			<content:encoded><![CDATA[<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/OpenCL_Logo_RGB.jpg"><img class="size-thumbnail wp-image-47080 alignnone" title="OpenCL_Logo_RGB" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/OpenCL_Logo_RGB-150x150.jpg" alt="" width="64" height="64" /></a></p>
<p>In support of the recent announcement of the<a href="http://www.intel.com/content/www/us/en/processors/core/core-processor-family.html"> 3<sup>rd</sup> Generation Intel® Core™ Processors</a>, Intel has released the Intel® SDK for OpenCL* Applications 2012. For the first time, OpenCL* developers using Intel® architecture can utilize compute resources across both Intel® Processors and Intel® HD Graphics Driver 4000/2500</p>
<p>From a person who, for the last couple of years has closely followed the emergence of the OpenCL standard, this announcement was something worth waiting for.  Less than a year ago, on this blog, I posted the news that the <a title="Permanent Link to Intel® OpenCL SDK 1.1 gold released" href="http://software.intel.com/en-us/blogs/2011/06/29/intel-opencl-sdk-11-gold-released/">Intel® OpenCL SDK 1.1 gold  was released</a>,  This was the first production OpenCL implementation from Intel targeting Intel® processors on Windows* OS. This current announcement is special, the Intel SDK for OpenCL Applications 2012 now supports not only the CPU but also the Intel HD Graphics 4000/2500 for Windows* 7 users.  We’ve come a long way in a year.</p>
<p style="text-align: center;"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/product_overview.jpg"><img class="aligncenter size-medium wp-image-47079" title="product_overview" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/04/product_overview-300x300.jpg" alt="Introducing the Intel(R) SDK For OpenCL* Applications" width="170" height="170" /></a></p>
<p>OpenCL <a href="http://www.intel.com/content/www/us/en/processors/core/core-processor-family.html">on the 3<sup>rd</sup> Generation Intel® Core Processor Family</a> extends Intel’s line of tools and APIs on Intel platforms and adds interoperability with other graphics APIs like DirectX*, OpenGL* and Intel® Media SDK, directly on the Intel HD Graphics device.</p>
<p>So what else is new in this release?</p>
<ul>
<li>A Single OpenCL* platform enables shared context for OpenCL applications running on both the CPU and Intel HD Graphics 4000/2500. The OpenCL platform with both CPU and HD Graphics devices is available seamlessly on the <a href="http://www.intel.com/p/en_US/support/detect/graphics">Intel® HD Graphics Drivers</a>.</li>
<li>Interoperability with the <a href="http://www.intel.com/software/mediasdk">Intel Media SDK</a> with no memory copy overhead</li>
<li>Improved performance for OpenCL applications running on Intel® Xeon® Processors and Intel® Core™ Processors. This CPU support is also available for Linux* OS developers.</li>
<li>Intel® SDK for OpenCL* applications development tools includes an offline compiler and a step-by-step OpenCL Kernel debugger (for CPU) integrated in Microsoft Visual Studio* 2010 integrated development environment.</li>
<li>10 OpenCL code samples, three of them new, are now available for independent download.</li>
</ul>
<p>The list above is just a sample of what is available with this new SDK. I recommend you read <a href="http://software.intel.com/file/43384">the product brief</a> or watch the <a href="http://software.intel.com/en-us/videos/channel/visual-computing/new-intel%C2%AE-sdk-for-opencl-applications-2012/1571382381001">introduction video</a> to get started with this new SDK.</p>
<p><strong>Download the SDK for free at <a href="http://www.intel.com/software/opencl">www.intel.com/software/opencl</a> and begin optimizing your applications for the 3<sup>rd</sup>Generation Intel® Core™ Processors today.</strong></p>
<p>Don’t forget to follow us on Twitter at <a href="https://twitter.com/#!/IntelOpenCL">@IntelOpenCL</a></p>
<p>&nbsp;</p>
<p style="text-align: center;"><object codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,47,0" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" height="300" width="345" id="flashObj"><param value="http://c.brightcove.com/services/viewer/federated_f9?isVid=1" name="movie" /><param value="#FFFFFF" name="bgcolor" /><param value="videoId=1571382381001&amp;playerID=741496470001&amp;playerKey=AQ~~,AAAArH1stHk~,LuRqJUw7MaeYQkat5frTpWWPINh71g7p&amp;domain=embed&amp;dynamicStreaming=true" name="flashVars" /><param value="http://admin.brightcove.com" name="base" /><param value="false" name="seamlesstabbing" /><param value="true" name="allowFullScreen" /><param value="true" name="swLiveConnect" /><param value="always" name="allowScriptAccess" /><embed pluginspage="http://www.macromedia.com/shockwave/download/index.cgi?P1_Prod_Version=ShockwaveFlash" allowscriptaccess="always" swliveconnect="true" allowfullscreen="true" type="application/x-shockwave-flash" seamlesstabbing="false" height="300" width="345" name="flashObj" base="http://admin.brightcove.com" flashvars="videoId=1571382381001&amp;playerID=741496470001&amp;playerKey=AQ~~,AAAArH1stHk~,LuRqJUw7MaeYQkat5frTpWWPINh71g7p&amp;domain=embed&amp;dynamicStreaming=true" bgcolor="#FFFFFF" src="http://c.brightcove.com/services/viewer/federated_f9?isVid=1"></embed></object></p>
<p>&nbsp;</p>
<p><strong><a href="https://twitter.com/#!/IntelOpenCL"></a></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/04/25/intel-announces-the-new-intel-sdk-for-opencl-applications-2012/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Video: Host-Based Provisioning Intel vPro clients using the ACU Wizard</title>
		<link>http://software.intel.com/en-us/blogs/2012/03/29/video-host-based-provisioning-intel-vpro-clients-using-the-acu-wizard/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/03/29/video-host-based-provisioning-intel-vpro-clients-using-the-acu-wizard/#comments</comments>
		<pubDate>Thu, 29 Mar 2012 22:12:40 +0000</pubDate>
		<dc:creator>Gael Holmes Hofemeier (Intel)</dc:creator>
				<category><![CDATA[Manageability & Security]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[ACU Wizard]]></category>
		<category><![CDATA[AMT]]></category>
		<category><![CDATA[SCS]]></category>
		<category><![CDATA[vpro]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/03/29/video-host-based-provisioning-intel-vpro-clients-using-the-acu-wizard/</guid>
		<description><![CDATA[There are two utilities that are packaged with the Intel SCS (Setup and Configuration Software) that are very simple to use and can help you enable Intel AMT quickly. Starting with Intel AMT 6.2, we have what is called "Host-based Provisioning." This means you can now configure and enable Intel AMT directly on the system [...]]]></description>
			<content:encoded><![CDATA[<p>There are two utilities that are packaged with the Intel SCS (Setup and Configuration Software) that are very simple to use and can help you enable Intel AMT quickly.  Starting with Intel AMT 6.2, we have what is called "Host-based Provisioning."  This means you can now configure and enable Intel AMT directly on the system - there is no need for tools that must be done from remote systems (no provisioning certificate is needed; no complicated server environment is needed either.)  This is not to say that prior to AMT 6.2 this whole process is complicated (it can be as complicated or as simple as your environmental needs dictate) but, in my opinion, Host-based Provisioning (HBP) makes things a whole lot more straight forward.</p>
<p>The attacthed video demonstrates how this tool can be used (and where to go get it.)  The user may decide that their environment calls for more configuration that what they get from this tool.  Here is a tip that the average developer may not know:</p>
<p>Along with the ACU Wizard, there is also a Configurator Command Line Interface (that's right - you can push scripts down to your client and have them run locally.  If you require Admin Control Mode rather than Client Control mode, there is a command for that too.  If you require<a href="How to create AMT Certificates using the AMT SDK and OpenSSL"> TLS Communications</a>, you can apply that once your system has AMT Enabled as well (if there is an API in the <a href="http://software.intel.com/en-us/articles/download-the-latest-intel-amt-software-development-kit-sdk/">SDK</a>, there is nothing you can't change about your AMT Configuration once Intel AMT has been enabled.)</p>
<p>You can download the most current version of the Intel SCS <a href="http://software.intel.com/en-us/articles/download-the-latest-version-of-intel-amt-setup-and-configuration-service-scs/">HERE </a>(note that soon it will be refreshed with Intel SCS 8.0)</p>
<p>The following video demonstrates how to quickly configure an Intel AMT Client using the ACU Wizard.</p>
<p><iframe width="420" height="315" src="http://www.youtube.com/embed/2bxnbEfNYc8" frameborder="0" allowfullscreen></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/03/29/video-host-based-provisioning-intel-vpro-clients-using-the-acu-wizard/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Video: Discovering your Intel® vPro Capabilities using the Intel® SCS Discovery Tool</title>
		<link>http://software.intel.com/en-us/blogs/2012/03/28/video-discovering-your-intel-vpro-capabilities-using-the-intel-scs-discovery-tool/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/03/28/video-discovering-your-intel-vpro-capabilities-using-the-intel-scs-discovery-tool/#comments</comments>
		<pubDate>Wed, 28 Mar 2012 22:24:45 +0000</pubDate>
		<dc:creator>Gael Holmes Hofemeier (Intel)</dc:creator>
				<category><![CDATA[Manageability & Security]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Intel AMT]]></category>
		<category><![CDATA[Intel SCS]]></category>
		<category><![CDATA[SCS Discovery]]></category>
		<category><![CDATA[vpro]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/03/28/video-discovering-your-intel-vpro-capabilities-using-the-intel-scs-discovery-tool/</guid>
		<description><![CDATA[I have already blogged quite a bit on this tool, but since it is pretty popular I thought I'd turn it into a video as well.  Running the  SCS Discovery tool is often my first suggestion to our developers who are having questions about why something about Intel AMT is not working for them.  In [...]]]></description>
			<content:encoded><![CDATA[<p>I have already <a href="http://software.intel.com/en-us/blogs/2012/01/24/how-to-run-the-scs-discovery-tool/">blogged</a> quite a bit on this tool, but since it is pretty popular I thought I'd turn it into a video as well.  Running the  <a href="http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&amp;DwnldID=21144&amp;keyword=%22scs%22&amp;lang=eng">SCS Discovery tool </a>is often my first suggestion to our developers who are having questions about why something about Intel AMT is not working for them.  In order to help them, it helps me to see what we are working with. The SCS Discovery tool helps us do just that. Watch this 13 minute video and see what its all about.</p>
<p><iframe width="420" height="315" src="http://www.youtube.com/embed/xw0uTAH0Lj0" frameborder="0" allowfullscreen></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/03/28/video-discovering-your-intel-vpro-capabilities-using-the-intel-scs-discovery-tool/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dualbooting Windows 7 and Windows 8</title>
		<link>http://software.intel.com/en-us/blogs/2012/03/20/dualbooting-windows-7-and-windows-8/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/03/20/dualbooting-windows-7-and-windows-8/#comments</comments>
		<pubDate>Tue, 20 Mar 2012 22:08:52 +0000</pubDate>
		<dc:creator>Rami Radi (Intel)</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Intel SW Partner Program]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Performance and Optimization]]></category>
		<category><![CDATA[Power Efficiency]]></category>
		<category><![CDATA[Site News & Announcements]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[dual boot]]></category>
		<category><![CDATA[dual booting]]></category>
		<category><![CDATA[dualboot]]></category>
		<category><![CDATA[Windows 8]]></category>
		<category><![CDATA[Windows 8 Consumer Preview]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/03/20/dualbooting-windows-7-and-windows-8/</guid>
		<description><![CDATA[The Windows 8 Consumer Preview ISO image became public a few days ago, which is available here, so I am sure a lot of people are interested in trying it out on their development systems without replacing their current Windows 7 installation. If you've ever dual booted a system before, the procedure for doing it [...]]]></description>
			<content:encoded><![CDATA[<p>The Windows 8 Consumer Preview ISO image became public a few days ago, which is available <a href="http://windows.microsoft.com/en-US/windows-8/iso">here</a>, so I am sure a lot of people are interested in trying it out on their development systems without replacing their current Windows 7 installation.</p>
<p>If you've ever dual booted a system before, the procedure for doing it for Windows 8 is not all that different. In summary, all you need to do is create a new partition for Windows 8, install it on that partition, and then edit your new boot menu if you want to keep Windows 7 as the default OS.</p>
<p><strong>Step One: Download and burn the Windows 8 Consumer Preview</strong></p>
<p>• Assuming that you downloaded the Consumer preview ISO image from the link above, you can use the <a href="http://www.microsoftstore.com/store/msstore/html/pbPage.Help_Win7_usbdvd_dwnTool">“ Microsoft Windows 7 USB/DVD Download Tool</a> to either burn the ISO image to a DVD disc or a USB drive. The tool is free, and very small, and installation instructions are available in the site itself and are very simple. Of course if you prefer to use other burning software like ImgBurn, you can do that too.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot4.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot4-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45518" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot1.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot1-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45519" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot5.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot5-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45520" /></a></p>
<p><strong>Step Two: Create a New Partition</strong></p>
<p>• Before you start, make sure to make a backup of your data and files. We will be creating new partitions and installing a new OS, so anything could go wrong, and you don't want to lose your everything. For paranoid people like me, I like taking "bare metal" backups of my systems with a wonderful open source and free tool called <a href="http://redobackup.org/">Redo Backup</a>. A bare metal backup takes a complete image of your hard drive, with all of its partitions. That way, I am able to restore my entire system the way it was exactly if needed. Going into more details about backups however is another topic.</p>
<p>• When you're ready, from within Windows 7, we will create some space for Windows 8 by using Windows' Disk Management. Click on the Start Menu and right click on "Computer", then click "Manage", and in the window that appears, click on "Disk Management" in the left sidebar.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot91.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot91-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45522" /></a></p>
<p>• Find your system hard disk in the graphical list that appears in the bottom pane. Right-click on it and then click "Shrink Volume".  20 GBs is a reasonable size that is not too small and not too big for the new Windows 8 partition, so shrink it down so you have at least 20GB of space left on the end of the drive, and click OK. Of course if you think you need more than 20 GB (if you are going to do intensive development and/or testing), or less than 20GB (if you don’t have enough space on your Windows 7 partition), then please feel free to choose a different size.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot10.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot10-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45523" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot11.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot11-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45524" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot12.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot12-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45525" /></a></p>
<p>• Then, click on the "Unallocated" block of that drive that appears and click "New Simple Volume". Click Next on the next few windows until you reach the "Format Partition" window. Here, give it a volume label you'll recognize (like "Windows 8") and click Next. It should format the drive for you. Now you're all set to install Windows 8.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot13.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot13-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45527" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot14.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot14-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45528" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot151.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot151-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45530" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot16.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot16-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45531" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot17.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot17-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45532" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot18.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot18-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45533" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot19.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/screenshot19-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45534" /></a></p>
<p><strong>Step Three: Install Windows 8</strong></p>
<p>• Now reboot your system, and go into your BIOS settings (for most systems, you need to press F2 or DEL). Now make sure your computer is set to boot from CD or USB as a first priority (depending on what medium you have decided to use earlier). This may be different from system to system though. Now reboot.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/IMG_0013.jpg"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/IMG_0013-300x167.jpg" alt="" width="300" height="167" class="alignnone size-medium wp-image-45535" /></a></p>
<p>• Now you should boot into the Windows 8 installer. It looks very similar to the Windows 7 installer, so it should be familiar. Pick your language and hit "Install Now”.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/1.jpg"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/1-300x167.jpg" alt="" width="300" height="167" class="alignnone size-medium wp-image-45536" /></a></p>
<p>• Enter the Product Key available on the Windows 8 Consumer Preview download page.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/2.jpg"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/2-300x167.jpg" alt="" width="300" height="167" class="alignnone size-medium wp-image-45537" /></a></p>
<p>• Now choose "Custom" when asked what type of install you'd like to perform. Then find the new partition you created on the list of drives shown. Make sure it's the right one, because remember, you are about to write over whatever is on it.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/3.jpg"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/3-300x167.jpg" alt="" width="300" height="167" class="alignnone size-medium wp-image-45538" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/4.jpg"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/4-300x167.jpg" alt="" width="300" height="167" class="alignnone size-medium wp-image-45545" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/5.jpg"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/5-300x167.jpg" alt="" width="300" height="167" class="alignnone size-medium wp-image-45546" /></a></p>
<p>• Hit "Next" and let the installer do its thing. When you're done, your computer should reboot into Windows 8. It'll probably reboot one more time after it does, then you will see the Windows 8 Start screen.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/6.jpg"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/6-300x167.jpg" alt="" width="300" height="167" class="alignnone size-medium wp-image-45541" /></a></p>
<p><strong>Step Four: Make Windows 7 the Default OS Again<strong></p>
<p>• You'll notice when you first boot up into Windows 8 the new graphical boot menu that will let you choose between Windows 7 and Windows 8. Windows 8 will be the default, meaning if you don't manually choose Windows 7 from the menu, your computer will boot into Windows 8 after 3 seconds, unless you interrupt it. If this is not something you want, follow the steps below to make Windows 7 the default OS again.</p>
<p>• On the boot menu, click on the button at the bottom that says "Change Defaults or Choose Other Options", and hit "Choose the Default Operating System". From there, you can pick Windows 7 from the menu. From now on, your computer will boot into Windows 7 by default</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/78.jpg"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/78-300x167.jpg" alt="" width="300" height="167" class="alignnone size-medium wp-image-45543" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/1.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/1-300x168.png" alt="" width="300" height="168" class="alignnone size-medium wp-image-45548" /></a></p>
<p>Thats it. Enjoy using the Windows 8 Consumer Preview, on your dualboot system.</p>
<p>Rami</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/03/20/dualbooting-windows-7-and-windows-8/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Video: Taking a look at the IMSS software on an Intel® vPro™ technology client (Intel® AMT)</title>
		<link>http://software.intel.com/en-us/blogs/2012/03/02/video-taking-a-look-at-the-imss-software-on-an-intel-vpro-technology-client-intel-amt/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/03/02/video-taking-a-look-at-the-imss-software-on-an-intel-vpro-technology-client-intel-amt/#comments</comments>
		<pubDate>Fri, 02 Mar 2012 17:27:59 +0000</pubDate>
		<dc:creator>Gael Holmes Hofemeier (Intel)</dc:creator>
				<category><![CDATA[Intel SW Partner Program]]></category>
		<category><![CDATA[Manageability & Security]]></category>
		<category><![CDATA[Software Tools]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/03/02/video-taking-a-look-at-the-imss-software-on-an-intel-vpro-technology-client-intel-amt/</guid>
		<description><![CDATA[If you have either a Notebook or a Desktop equipped with Intel(R) vPro Technology and you are also working with one of it's components, Intel(R) Active Management Technology, chances are you have the "Intel Management and Security Status" software installed on your system. In this video, we start this software and take a closer look [...]]]></description>
			<content:encoded><![CDATA[<p>If you have either a Notebook or a Desktop equipped with Intel(R) vPro Technology and you are also working with one of it's components, Intel(R) Active Management Technology, chances are you have the "<strong><em>Intel Management and Security Status</em></strong>" software installed on your system.  In this video, we start this software and take a closer look at all the information that it provides.</p>
<p>The video is almost 13 minutes:</p>
<p><iframe width="420" height="315" src="http://www.youtube.com/embed/zPjOeAsYzUg" frameborder="0" allowfullscreen></iframe></p>
<p>Still have questions?  Please go to our <a href="http://software.intel.com/en-us/forums/intel-vpro-software-development/">Developer's Forum</a> and ask away!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/03/02/video-taking-a-look-at-the-imss-software-on-an-intel-vpro-technology-client-intel-amt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Introducing Cloud IdaaS - Intel Cloud SSO</title>
		<link>http://software.intel.com/en-us/blogs/2012/02/27/introducing-cloud-idaas-intel-cloud-sso/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/02/27/introducing-cloud-idaas-intel-cloud-sso/#comments</comments>
		<pubDate>Tue, 28 Feb 2012 00:20:02 +0000</pubDate>
		<dc:creator>George Jobi (Intel)</dc:creator>
				<category><![CDATA[Intel SW Partner Program]]></category>
		<category><![CDATA[Manageability & Security]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[IdaaS]]></category>
		<category><![CDATA[identity]]></category>
		<category><![CDATA[IPT]]></category>
		<category><![CDATA[SaaS]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[Ultrabooks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/02/27/introducing-cloud-idaas-intel-cloud-sso/</guid>
		<description><![CDATA[Today we are introducing Intel’s first enterprise focused SaaS offering that brings the best of Intel cloud offering ECA 360 and McAfee Cloud Security Suite product into a multi-tenant offering. The same core functionality that our on-prem product ECA 360 has Single-Sign-On (SSO) for cloud application, User Provisioning and Strong Auth is now supported in [...]]]></description>
			<content:encoded><![CDATA[<p>Today we are introducing Intel’s first enterprise focused SaaS offering that brings the best of Intel cloud offering <a title="ECA360" href="http://software.intel.com/en-us/articles/Expressway-Cloud-Access-360-Identity-Federation/">ECA 360</a> and <a href="http://www.mcafee.com/us/products/cloud-identity-manager.aspx">McAfee </a>Cloud Security Suite product into a multi-tenant offering. The same core functionality that our on-prem product ECA 360 has Single-Sign-On (SSO) for cloud application, User Provisioning and Strong Auth is now supported in a fully flexible as-a-Service offering.</p>
<p><strong>Built on Salesforce.com for Salesforce.com and everyone else</strong></p>
<p>While looking at bringing this offering we found an amazing partner in <a href="http://www.salesforce.com">Salesforce.com</a>, a partners who has been major player in cloud ecosystem and had been offering enterprise grade cloud service for a while. What better choice than Salesforce.com, who have completely transformed the cloud SaaS space with their CRM offering?  They had been trailblazing the SaaS space with amazing agility and now with the increasing adoption of Sales cloud, Chatter, Heroku, AppExchange are transforming into a full cloud platform. The <a href="http://www.force.com/">Salesforce platform</a> had been gaining its share of attention with <a href="http://blogs.salesforce.com/company/2011/10/if-you-build-it-and-a-whole-lot-of-people-work-really-really-hard-for-six-years-they-will-come.html">recently announced millionth download</a> and big ISV traction. With it’s almost a decade of running cloud service for the full spectrum of customer base starting from few person org to Fortune 500 companies salesforce has earned the trust in running “Cloud based services that are reliable and trusted” .</p>
<p><strong>Intel hardware ++</strong></p>
<p>Intel Cloud SSO service also brings together the fruits of Intel hardware and software integration to provided enhanced security. The service we are announcing today not only brings the single-sign on capability along with user accounts life cycle management and all traditional ways of supporting strong auth but also introduces support for context aware security. We have integrated Intel Identity Protection technology allowing all Intel platforms that support IPT like the cool <a href="http://www.intel.com/content/www/us/en/sponsors-of-tomorrow/ultrabook.html">Ultrabooks</a> to provide enhanced security while accessing cloud application. We have just gotten started, stay tuned for more exciting hardware/software integrated enhancements in the future, while we do more integration with McAfee services such as <a href="http://www.mcafee.com/us/mcafee-labs/technology/global-threat-intelligence-technology.aspx">Global Threat Intelligence</a> and others.</p>
<p>For more info and to sign up for beta please visit <a href="http://intelcloudsso.com">http://intelcloudsso.com</a></p>
<p><a href="http://twitter.com/#!/search/realtime/intelcloudsso">http://twitter.com/#!/search/realtime/intelcloudsso</a></p>
<div id="attachment_45178" class="wp-caption aligncenter" style="width: 751px"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/02/intelcloudsso_jobi.gif"><img class="size-full wp-image-45178 " title="intelcloudsso_jobi" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/02/intelcloudsso_jobi.gif" alt="" width="741" height="562" /></a><p class="wp-caption-text">Intel Cloud SSO</p></div>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/02/27/introducing-cloud-idaas-intel-cloud-sso/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Intel Tool Helps SW Developers Develop More Secure Applications</title>
		<link>http://software.intel.com/en-us/blogs/2012/02/07/intel-tool-helps-sw-developers-develop-more-secure-applications/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/02/07/intel-tool-helps-sw-developers-develop-more-secure-applications/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 23:27:40 +0000</pubDate>
		<dc:creator>Robert Chesebrough (Intel)</dc:creator>
				<category><![CDATA[Manageability & Security]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Buffer Overflow]]></category>
		<category><![CDATA[Buffer Overrun]]></category>
		<category><![CDATA[Build Security In]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Common Weakness Evaluation]]></category>
		<category><![CDATA[Intel Compiler]]></category>
		<category><![CDATA[Intel® vPro™]]></category>
		<category><![CDATA[Mitigate Secure Bugs]]></category>
		<category><![CDATA[OS Command Injection]]></category>
		<category><![CDATA[owasp top 10]]></category>
		<category><![CDATA[scanf]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[security layer]]></category>
		<category><![CDATA[sprintf]]></category>
		<category><![CDATA[static security analysis]]></category>
		<category><![CDATA[Ultrabook]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/02/07/intel-tool-helps-sw-developers-develop-more-secure-applications/</guid>
		<description><![CDATA[Developers are urged to find these kinds of bugs using tools such as Intel Static Security Analysis, and then make it a practice to validate all inputs and replace unsafe functions (strcpy, strncpy, strcat, and gets, among others)  with safe counterparts.  To learn more about steps you can take as a developer to reduce your exposure to security attacks go to the Department of Homeland Security's Build Security In website or visit the Common Weakness Evaluation site.
]]></description>
			<content:encoded><![CDATA[<p>There has been a steady occurrence of security breaches at prestigious companies over the last weeks, months and years.  These breaches are becoming far too frequent and, as the folks at Amazon and Zappos know, expensive.</p>
<p>A wide variety of ways exist for addressing these kinds of security challenges and Intel offers technologies to assist in the battle.  For probably the most sane and scalable way of addressing security issues, at least for enterprise applications, I would recommend that you jump over to Blake Dournaee's (Intel) blogs <a href="http://software.intel.com/en-us/blogs/2010/11/09/using-a-service-gateway-to-protect-against-the-owasp-top-10/">"Using a Service Gateway to Protect against the OWASP Top 10</a>" and "<a href="http://software.intel.com/en-us/blogs/2011/02/10/how-about-a-security-layer/">How about a Security Layer?"</a>.  The idea of a Security Layer on a Service Gateway is truly the most comprehensive way to tackle these kinds security issues. </p>
<p>Never-the-less, some enterprise shops may be unwilling to re-architect their legacy systems using the security layer approach and some developers are targeting client applications.  What tools and techniques can these developers use to mitigate security bugs?  For those developers I offer the following:</p>
<p> I had a chat with Julian Horn, who is Intel's architect on the compiler team for the Static Security Analysis (SSA) tool.  SSA comes as part of the Intel compiler (C/C++/Fortran) and is available for Linux* and Windows*.  SSA identifies various coding errors such as memory and resource leaks, pointer and array errors, incorrect use of OpenMP* directives, and incorrect use of Cilk Plus language features.  SSA also identifies security errors such as buffer overflows and boundary violations, use of uninitialized variables and objects, incorrect usage of pointers and dynamically allocated memory, dangerous use of unchecked input, arithmetic overflow and divide by zero, and misuse of string, memory, and formatting library routines.</p>
<p>I was curious what kind of security flaws that SSA could find. Specifically, I wanted to know if it could help developers to mitigate any of the most dangerous software errors as identified by the <a href="http://cwe.mitre.org/top25/index.html#CWE-798">Common Weakness Enumeration</a> (CWE) community sponsored by Mitre.</p>
<p>After an email exchange with Julian, and pouring over the descriptions of the top twenty-five security bugs as reported by CWE, I determined that the Intel SSA could help to mitigate at least two of the top five errors listed.  Coming in at error number two is "OS Command Injection", and at number three was "Classic Buffer Overflow". How can SSA mitigate these errors?</p>
<p><strong>Identification and Mitigation of CWE Top Error #2 (OS Command Injection)</strong><strong><br />
</strong>OS Command Injection is an error type that really should be checked on both server and client side applications.  The essence of this error, or potential attack, is that, sometimes, your application is a bridge linking an outsider to the internals of your operating system. If your application simply passes un-trusted inputs to be fed into a command string that you pass to a system call, then your application can inadvertently wreak all kinds of havoc on the system. <em>The recommended mitigation step is to validate all inputs to your application</em>.</p>
<p>A simple minded BAD example, in "C", might be issuing a system call to delete a file that a user types in.</p>
<p><em>          // user inputs a filename to be deleted<br />
</em><em>          scanf (“%s”, str);                        // buffer overflow</em><em><br />
<em>          sprintf (cmd, "del %s", str);   // another buffer overflow</em><br />
<em>          system(cmd);                             // OS command injection, due to not validating the input</em></em></p>
<p>What happens if the user types into the input *.* rather than a normal filename?  Since the input has not been validated and was passed right to the OS<ins datetime="2012-02-06T09:55" cite="mailto:C%20Breshears">,</ins> then clearly deletions unintended by the developer would occur. </p>
<p>Analyzing your code for un-validated input is known as <em>taint analysis – tainted input means un-validated input</em>.  CWE recommends doing a taint analysis to identify where in your code you are not validating input, and then take steps to remove the taint.</p>
<p>Intel's Static Security Analysis tool uses a taint analysis algorithm to detect whether or not an unknown input has been compared against another value.  There are various rules under which taintedness is propagated from one variable to another.  One rule is that when a value is <em>compared </em>against another value this removes the taint.  If an untested value is used in a “dangerous” context, then you get an error reported by SSA.</p>
<p>The logic here is that a <em>comparison </em>is considered sufficient to sanitize the value.  The example below demonstrates the idea of tainted variable, x.  When x is used blindly with no comparisons done on it to check it validity, SSA flags this value as tainted:</p>
<p><em>          x = input;</em><br />
<em>          a[x] = 0;   // SSA identifies use of tainted value x</em></p>
<p>The example below uses a comparison operator to check the input value x, so it is considered untainted now by SSA:</p>
<p><em>          x = input;</em><br />
<em>          ok = (x &lt; 10);     // comparison un-taints the value x</em><br />
<em>          if (ok) a[x] = 0;</em></p>
<p>This "good" example might still have some issues with it, the checking is not extensive, but at least the developer went to some effort to validate the input.</p>
<p>The key take away here is to use tools to find un-validated inputs and then add the necessary validation around each of these inputs.</p>
<p><strong>Identification and Mitigation of CWE Top Error #3 (Classic Buffer Overflow)</strong><strong><br />
</strong>Michael Howard &amp; David LeBlanc, in their book <em>Writing Secure Code</em>, 2nd edition, identify the buffer overflow (AKA buffer overrun) as public enemy number one.  The Common Weakness Enumeration list is kinder, listing this issue as the number three most dangerous error.  It is well known that certain <a href="http://tldp.org/HOWTO/Secure-Programs-HOWTO/dangers-c.html">"C" functions are unsafe</a> because they are vulnerable to buffer overflow attacks. These functions should be replaced with <a href="http://msdn.microsoft.com/en-us/library/bb288454.aspx">safe counterparts</a> : <em>strcpy</em>, <em>strncpy</em>, <em>strcat</em>, and <em>gets</em>, among others. </p>
<p>The buffer overflow terminology comes from the idea that if you continue to pour water into a finite sized container, the container will eventually overflow. In computer terms, the analogy means that copying too much text into a finite sized array<span style="text-decoration: line-through;"><del datetime="2012-02-06T10:10" cite="mailto:C%20Breshears">,</del></span> will cause the extra text in the buffer to spill over into areas of memory that the developer did not intend.  These areas of memory get corrupted with the excess text and malicious coders use this to exploit your application and potentially run malicious code within the confines of your application's process. I found the following buffer overflow example insightful, though I didn't want to copy it here in its entirety and will simply link<ins datetime="2012-02-06T10:10" cite="mailto:C%20Breshears"> </ins>to it instead.  It demonstrates how an overflow attack can occur and is found on an <a href="http://blogs.msdn.com/b/roberthorvick/archive/2004/01/16/59460.aspx">MSDN blog by Robert Horvick</a>.</p>
<p>Other strains of buffer overflow can occur in some types of formatted input.  The biggest issue here is when the "%s" input format is used. This format specifier is generally regarded as unsafe. In the <a href="http://software.intel.com/sites/products/evaluation-guides/docs/intelparallelstudio-evaluationguide-ssa.pdf">Intel Parallel Studio XE evaluation guide on Static Security Analysis (SSA)</a>, there is a nice example of SSA detecting a buffer overflow in a <em>fscanf</em> function. In this case, SSA indicates that it found an "unsafe format specifier<ins datetime="2012-02-06T10:22" cite="mailto:C%20Breshears">,</ins>" which is essentially a condition that can lead to buffer overflow.  The code snippet from this guide is as follows:</p>
<p>          // example that would allow buffer overflow  <br />
<em>          char data[255];</em><em><br />
<em>          fscanf(dfile, "%s", data);</em></em><br />
<em>          if (strcmp(data, string) != 0) {</em><br />
<em>                fprintf(stderr, "parse: Expected %s, got %s \n", string, data);<br />
          }</em></p>
<p>The call to <em>fscanf </em>uses an input descriptor string with a “%s” format specifier. This reads input characters up to the next newline and stores the data in the array “data”. There is no guarantee that the number of characters read will not overflow the bounds of the array, so this statement could corrupt memory.  SSA reported this as an error and the developer should follow up by making code changes using an alternative format specifier such as the "%255s"  to limit the number of characters read in.  The corrected code should be something like this:</p>
<p><em>          // example that corrects the undesired  buffer overflow  condition<br />
</em><em>          char data[255];</em><em><br />
<em>          fscanf(dfile, "%255s", data);</em></em><br />
<em>          if (strcmp(data, string) != 0) {</em><br />
<em>                fprintf(stderr, "parse: Expected %s, got %s \n", string, data);</em><em></em></p>
<p>For similar tips on how to protect your code through defensive programming, read this article by McGraw &amp; Viega, <em><a href="http://www.ibm.com/developerworks/library/s-buffer-defend.html">Make your software behave: Preventing buffer overflows</a>.</em></p>
<p>The key take away here, in addition to validating all  inputs, is to find unsafe “C” functions and format specifiers, and replace them with safe alternatives .</p>
<p><strong>What should a developer do?</strong><br />
The security bugs discussed above are two of the most dangerous and prevalent according to CWE.  These bugs affect client applications that are run on laptops, desktops, ultrabooks, as well as enterprise applications on web servers, application servers, database servers and more.  Developers are urged to find these kinds of bugs using tools such as <a href="http://software.intel.com/sites/products/evaluation-guides/docs/intelparallelstudio-evaluationguide-ssa.pdf">Intel Static Security Analysis</a>, and then make it a practice to validate all inputs and to replace unsafe functions (<em>strcpy</em>, <em>strncpy</em>, <em>strcat</em>, and <em>gets</em>, among others)  with <a href="http://msdn.microsoft.com/en-us/library/bb288454.aspx">safe counterparts</a>.  To learn more about steps you can take as a developer to reduce your exposure to security attacks go to the Department of Homeland Security's <a href="https://buildsecurityin.us-cert.gov/bsi-rules/home/g1/816-BSI.html">Build Security In</a> website or visit the <a href="http://cwe.mitre.org/top25/index.html#CWE-798">Common Weakness Evaluation</a> site.</p>
<p>Oh, and did I mention that enterprise developers  should  jump over to Blake Dournaee's (Intel) blogs <a href="http://software.intel.com/en-us/blogs/2010/11/09/using-a-service-gateway-to-protect-against-the-owasp-top-10/">"Using a Service Gateway to Protect against the OWASP Top 10</a>" and "<a href="http://software.intel.com/en-us/blogs/2011/02/10/how-about-a-security-layer/">How about a Security Layer?"</a> to learn an even better way to secure your systems?</p>
<p>For more complete information about compiler optimizations, see our <a href="http://software.intel.com/en-us/articles/optimization-notice/">Optimization Notice</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/02/07/intel-tool-helps-sw-developers-develop-more-secure-applications/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Coarse-grained locks and Transactional Synchronization explained</title>
		<link>http://software.intel.com/en-us/blogs/2012/02/07/coarse-grained-locks-and-transactional-synchronization-explained/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/02/07/coarse-grained-locks-and-transactional-synchronization-explained/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 22:55:02 +0000</pubDate>
		<dc:creator>James Reinders (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Haswell]]></category>
		<category><![CDATA[HLE]]></category>
		<category><![CDATA[RTM]]></category>
		<category><![CDATA[transactional memory]]></category>
		<category><![CDATA[TSX]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/02/07/coarse-grained-locks-and-transactional-synchronization-explained/</guid>
		<description><![CDATA[Coarse-grained locks, and the importance of transactions, are key concepts that motivate why Intel Transactional Synchronization Extensions (TSX) is useful.  I’ll do my best to explain them in this blog. In my blog “Transactional Synchronization in Haswell,” I describe new instructions (Intel TSX) that will improve the performance of coarse-grained locks.  Understanding coarse-grained locks and [...]]]></description>
			<content:encoded><![CDATA[<p>Coarse-grained locks, and the importance of transactions, are key concepts that motivate why Intel Transactional Synchronization Extensions (TSX) is useful.  I’ll do my best to explain them in this blog.</p>
<p>In my blog “<a href="../../../../2012/02/07/transactional-synchronization-in-haswell">Transactional Synchronization in Haswell</a>,” I describe new instructions (Intel TSX) that will improve the performance of coarse-grained locks.  Understanding coarse-grained locks and the concept of transactions are both key to understanding why Intel TSX matters.</p>
<p>Intel TSX may enhance performance of mutual exclusion other than simple coarse-grained locks, but I will focus on coarse-grained locking because it is common and Intel TSX allows highly concurrent accesses using only a simple locking mechanism.</p>
<p><strong>An example</strong></p>
<p>To motivate by illustration, let’s consider a simple hash table. Hash tables are used to map a <em>key</em> to a <em>key</em> and <em>value</em> pair in linear time. Two key operations are add (insert) and lookup (retrieve). Resizing and deletion are two additional operations of general interest also, but I will leave them for another time.</p>
<p>Designing a highly concurrent hash table is a non-trivial task, and there are many approaches to allow high levels of concurrency.  All these approach add complexity to the program, and often to the data structures themselves.</p>
<p>The simplest approach is a <em>single lock</em> approach. In such an approach, every operation on the hash table starts by obtaining the lock for the table and concludes by releasing the lock. While the lock is held for the operation, no other task on the system can obtain the lock and therefore no hash table operation is allowed to proceed.</p>
<p>Considering Figure 1, no concurrent operations are allowed, so each of the five operations shown would occur one at a time.</p>
<div style="text-align: center;"><img src="../../../../wordpress/wp-content/uploads/2012/01/Slide1.png" alt="" width="77%" /></p>
<p><strong>Figure 1: Five hash table operations requested</strong></div>
<p><strong>Solutions</strong></p>
<p>A common solution is to break the hash table into smaller regions, and have locks that apply to regions. While this can reduce contention, it still can create needless delays and it definitely complicates the coding and the data structure.</p>
<p>Such an approach is a prime example of taking a <strong>coarse-grained lock</strong> (a single lock for the entire hash table) and working to make it a finer grained lock (multiple locks for smaller table sections). Coarse-grained locks are easier to use, easier to understand and easier to debug.  The only disadvantage is that they tend to impede performance in a multithreaded environment. Multicore processors are increasing the likelihood of this being a problem, and help motivate new hardware assistance so that programming has a chance to stay simple more often than without assistance.</p>
<p><strong>Transactional Synchronization (Intel TSX) as a solution</strong></p>
<p><strong> </strong></p>
<p>What would be ideal, is to use the single lock (coarse-grained locking) because it is easy and not very error prone, but still have the performance of a fine-grained implementation. In our Figure 1 example, only one operation conflicts with another. This example does have more conflicts that would be expected in a real world example.</p>
<p>Considering this example, three of the operations have no collision with the other operation so the use of HLE (part of Intel TSX) on the single lock will completely elide the lock. In other words, the performance is very close to the performance of the code if no locking or unlocking code was present. The key however is that the operations are protected by the Intel TSX hardware, which has silently ensured that the protection intended by the lock is indeed assured.</p>
<p>The two operations that map to the same hash table entry will need to be staggered. This will occur even if we are unlucky enough to have them happen at the same time. In such a case, the Intel TSX will detect that the lock was indeed needed and some locking overhead will be incurred. What would actually happen in such a case, is that the colliding tasks will proceed into the protected code until the processor detects the conflict. As such a point, both updates will abort their protected code (also called the transaction). The most common solution then is to have each task proceed but actually enforce the lock on the second try. This means that one task will win, and delay the other, until the operation is complete. The precise decision on how to handle the collision is either up to the processor implementation with HLE, or the programmer with RTM. The processor implementation for HLE will also be fairly simple and conservative, in order to preserve the semantics of the original lock and hence compatibility with processors that lack Intel TSX.</p>
<p><strong>Summary</strong></p>
<p>For a hash map, Intel TSX allows for the right things to occur without losing the protection that the locks need to give. Intel TSX ensures the same results as the coarse-grained lock guarantees, but allows unrelated operations to proceed without delays that the coarse-grained locks would have caused. For more information on Transactional Synchronization, see my blog on <a href="http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/">Intel TSX</a>.</p>
<p>Please check out the <a href="http://software.intel.com/file/41604">specification</a> and stay tuned for information about supporting tools from Intel and others in the coming months.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/02/07/coarse-grained-locks-and-transactional-synchronization-explained/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

