<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blogs &#187; Parallel Programming</title>
	<atom:link href="http://software.intel.com/en-us/blogs/category/parallel/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<lastBuildDate>Fri, 10 Feb 2012 03:07:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Myths about static analysis. The third myth - dynamic analysis is better than static analysis.</title>
		<link>http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-third-myth-dynamic-analysis-is-better-than-static-analysis/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-third-myth-dynamic-analysis-is-better-than-static-analysis/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 18:45:26 +0000</pubDate>
		<dc:creator>Andrey Karpov</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Static code analysis]]></category>
		<category><![CDATA[static code analyzer]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-third-myth-dynamic-analysis-is-better-than-static-analysis/</guid>
		<description><![CDATA[While communicating with people on forums, I noticed there are a few lasting misconceptions concerning the static analysis methodology. I decided to write a series of brief articles where I want to show you the real state of things. The third myth is: "Dynamic analysis performed by tools like valgrind for C/C++ is much better [...]]]></description>
			<content:encoded><![CDATA[<p>While communicating with people on forums, I noticed there are a few lasting misconceptions concerning the static analysis methodology. I decided to write a series of brief articles where I want to show you the real state of things. </p>
<p>The third myth is: "Dynamic analysis performed by tools like valgrind for C/C++ is much better than static code analysis".</p>
<p>The statement is rather strange. Dynamic and static analyses are just two different methodologies which supplement each other. Programmers seem to understand it, but I hear it again and again that dynamic analysis is better than static analysis.</p>
<p>Let me list advantages of static code analysis.</p>
<h2>Diagnostics of all the branches in a program</h2>
<p>Dynamic analysis in practice cannot cover all the branches of a program. After these words, fans of valgrind tell me that one should create appropriate tests. They are right in theory. But anyone who tried to create them understands how complicated and long it is. In practice, even good tests cover not more than 80% of program code.</p>
<p>It is especially noticeable in code fragments handling non-standard/emergency situations. If you take an old project and check it with a static analyzer, most errors will be detected in these very places. The reason is that even if the project is old, these fragments stay almost untested. Here is a brief example to show you what I mean (FCE Ultra project):</p>
<pre>fp = fopen(name,"wb");
int x = 0;
if (!fp)
  int x = 1;</pre>
<p>The 'x' flag will not be equal to one if the file wasn't opened. It is because of such errors that something goes wrong in programs: they crash or generate meaningless messages instead of adequate error messages.</p>
<h2>Scalability</h2>
<p>To be able to check large projects through dynamic methods regularly, you have to create a special infrastructure. You need special tests. You need to launch several instances of an application in parallel with different input data.</p>
<p>Static analysis is scaled several times easier. Usually you need only a multi-core computer to run a tool performing static analysis.</p>
<h2>Analysis at a higher level</h2>
<p>One of the advantages of dynamic analysis is that it knows what function and with what arguments is being called. Consequently, it can check if the call is correct. Static analysis can't know it and can't check arguments' values in most cases. This is a disadvantage of this method. But static analysis performs analysis at a higher level than dynamic analysis. This feature allows a static analyzer to detect issues which are correct from the viewpoint of dynamic analysis. Here is a simple example (ReactOS project):</p>
<pre>void Mapdesc::identify( REAL dest[MAXCOORDS][MAXCOORDS] )
{
  memset( dest, 0, sizeof( dest ) );
  for( int i=0; i != hcoords; i++ )
    dest[i][i] = 1.0;
}</pre>
<p>Everything is good here from the viewpoint of dynamic analysis, while static analysis gives the <a href="http://www.viva64.com/en/d/0100/">alarm</a> because it is very suspicious that the number of bytes being cleared in an array coincides with the number of bytes the pointer consists of.</p>
<p>Here you are another example from the Clang project:</p>
<pre>MapTy PerPtrTopDown;
MapTy PerPtrBottomUp;
void clearBottomUpPointers() {
  PerPtrTopDown.clear();
}
void clearTopDownPointers() {
  PerPtrTopDown.clear();
}</pre>
<p>Is there anything here dynamic analysis may find suspicious? Nothing. But a static analyzer can suspect there is something wrong. The error is this: inside clearBottomUpPointers() there must be this code: "PerPtrBottomUp.clear();".</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-third-myth-dynamic-analysis-is-better-than-static-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Myths about static analysis. The fourth myth - programmers want to add their own rules into a static analyzer.</title>
		<link>http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-fourth-myth-programmers-want-to-add-their-own-rules-into-a-static-analyzer/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-fourth-myth-programmers-want-to-add-their-own-rules-into-a-static-analyzer/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 18:45:17 +0000</pubDate>
		<dc:creator>Andrey Karpov</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Static code analysis]]></category>
		<category><![CDATA[static code analyzer]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-fourth-myth-programmers-want-to-add-their-own-rules-into-a-static-analyzer/</guid>
		<description><![CDATA[While communicating with people on forums, I noticed there are a few lasting misconceptions concerning the static analysis methodology. I decided to write a series of brief articles where I want to show you the real state of things. The fourth myth is: "A static analyzer must enable users to add user-made rules. Programmers want [...]]]></description>
			<content:encoded><![CDATA[<p>While communicating with people on forums, I noticed there are a few lasting misconceptions concerning the static analysis methodology. I decided to write a series of brief articles where I want to show you the real state of things. </p>
<p>The fourth myth is: "A static analyzer must enable users to add user-made rules. Programmers want to add their own rules."</p>
<p>No, they don't. They actually want to solve some tasks of searching for particular language constructs. It is not the same thing as creating diagnostic rules.</p>
<p>I have always answered that implementation of own rules is not the thing programmers actually want. And I never saw any other alternative than implementing diagnostic rules by the analyzer's developers at the request of programmers (<a href="http://www.viva64.com/en/b/0110/">an article on the subject</a>). I have had a fruitful conversation with Dmitry Petunin recently. He is the director of an Intel department of compiler testing and software verification tool development. He enlarged my understanding of this subject and voiced the idea I had been pondering over but failed to give the final formulation of.</p>
<p>Dmitry confirmed my belief that programmers wouldn't write diagnostic rules. The reason is very simple - it is very hard. Some static analysis tools enable users to extend the rule set. But it is done rather as a pure formality or for convenience of the tool's developers themselves. You need to know the subject very deeply to be able to develop new diagnostic rules. If an enthusiast without skill starts creating them, they will be of little use.</p>
<p>My understanding of the issue was over at this point. Dmitry, being more skilled than I, helped me learn more. In brief, this is how the situation looks.</p>
<p>Indeed, programmers want to be able to search for some particular patterns/errors in their code. They really need it. For example, someone needs to find all the explicit conversions of the int type to float. This task cannot be solved by such tools as grep, since it is unknown what type the FOO() function will return in a "float(P-&gt;FOO())" -like construct. At this moment the programmer comes to the idea that he/she can implement search of such constructs by adding his/her own check into the static analyzer.</p>
<p>This is where the key point lies. The programmer does not need to create his/her analysis rules. He/she needs to solve a particular issue. What he/she wants is a very small task from the viewpoint of static analysis mechanisms. It is like using a car to light cigarettes with its cigarette lighter.</p>
<p>That's why both Dmitry and I don't support the idea of providing users with API to handle the analyzer. It is an extremely difficult task from the viewpoint of development. Besides, people will hardly use more than 1% of it. So, it's irrational. It is easier and cheaper for a developer to implement users' wishes than create a complex API for add-ons or a special language of rule description.</p>
<p>The readers may say: "then make only 1% of the functionality in API available, and everyone will be happy". Yes, right. But look where the emphasis has moved: from developing own rules we have come to the idea that we just need a tool similar to grep but possessing some additional information about program code.</p>
<p>There is no such a tool yet. If you want to solve some task, write to me, and we will try to implement it in the PVS-Studio analyzer. For example, we have recently implemented several requests on searching for explicit type conversions: <a href="http://www.viva64.com/en/d/0199/">V2003</a>, <a href="http://www.viva64.com/en/d/0200/">V2004</a>, <a href="http://www.viva64.com/en/d/0201/">V2005</a>. It is much easier for us to implement such wishes than create and maintain an open interface. It's also easier for users themselves.</p>
<p>By the way, such a tool might appear some time later within the scope of Intel C++. Dmitry Petunin said they had discussed a probability of creating a grep-like tool possessing knowledge about code structure and variable types. But it was discussed just in theory. I don't know whether or not they really intend to create this tool.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-fourth-myth-programmers-want-to-add-their-own-rules-into-a-static-analyzer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Myths about static analysis. The fifth myth - a small test program is enough to evaluate a tool.</title>
		<link>http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-fifth-myth-a-small-test-program-is-enough-to-evaluate-a-tool/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-fifth-myth-a-small-test-program-is-enough-to-evaluate-a-tool/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 18:44:58 +0000</pubDate>
		<dc:creator>Andrey Karpov</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Static code analysis]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-fifth-myth-a-small-test-program-is-enough-to-evaluate-a-tool/</guid>
		<description><![CDATA[While communicating with people on forums, I noticed there are a few lasting misconceptions concerning the static analysis methodology. I decided to write a series of brief articles where I want to show you the real state of things. The fifth myth: "You can easily evaluate capabilities of a static analyzer on a small test [...]]]></description>
			<content:encoded><![CDATA[<p>While communicating with people on forums, I noticed there are a few lasting misconceptions concerning the static analysis methodology. I decided to write a series of brief articles where I want to show you the real state of things.</p>
<p>The fifth myth: "You can easily evaluate capabilities of a static analyzer on a small test code".</p>
<p>This is how this statement looks in discussions on forums (this is a collective image):</p>
<p><i>I've written a special program, its size is 100 code lines. But the analyzer doesn't generate anything although all the warning levels are enabled. This [tool of yours] / [static analysis] in general is just rubbish.</i></p>
<p>It is not the static analysis methodology which is rubbish, but this approach to evaluating the usability of a particular tool. The incorrectness of this kind of tool studying consists of two aspects:</p>
<p>1.</p>
<p>Programmers think they don't make simple mistakes. This phenomenon was discussed in <a href="http://www.viva64.com/en/b/0116/">Myth 2</a>. So they try to feed an analyzer with a tricky sample and feel happy secretly when the analyzer can't find the error. This game is interesting yet senseless.</p>
<p>You should understand that most errors are simple as hell, and static analyzers detect them very well. The paradox is that it's much more difficult to invent a simple mistake than a complicated one. Here you are an example. Can you ever guess to write a sample like this?</p>
<pre>int threadcounts[] = { 1, kNumThreads };
for (size_t i = 0;
     i &lt; sizeof(threadcounts) / sizeof(threadcounts); i++) {</pre>
<p>I doubt. I cannot imagine one can make such a silly mistake and write "sizeof(threadcounts) / sizeof(threadcounts)". So, such an example will never be created on purpose. By the way, this fragment is taken not from a student's lab work, but from the Chromium project. It is diagnosed by the PVS-Studio analyzer very easily, of course.</p>
<p>2.</p>
<p>Written samples are of random character, and they are few. So you may get very different results depending on chance. You may invent 5 errors that will be successfully found by one analyzer and not found by another analyzer. Or you may create a program with five errors, and two analyzers will give opposite results for it. The sampling for such an investigation is too small. To be able to compare and study tools with at least somewhat reliable results, you must write a program text with at least 500 different errors. An investigation based on 5-10 errors is not reliable.</p>
<p>Moreover, programmers expect to see diagnostic messages on errors of some particular type and forget about the rest. For example, almost all the programmers write one and the same sample with a memory release defect:</p>
<pre>void Foo()
{
  int *a = (int *)malloc(X);
  int *b = (int *)malloc(Y);
  //...
  free(a);
}</pre>
<p>Some analyzers detect this error, the others don't. For instance, PVS-Studio does not diagnose memory leaks currently. But it can find the following stuff:</p>
<pre>static int rr_cmp(uchar *a,uchar *b)
{
  if (a[0] != b[0])
    return (int) a[0] - (int) b[0];
  if (a[1] != b[1])
    return (int) a[1] - (int) b[1];
  if (a[2] != b[2])
    return (int) a[2] - (int) b[2];
  if (a[3] != b[3])
    return (int) a[3] - (int) b[3];
  if (a[4] != b[4])
    return (int) a[4] - (int) b[4];
  if (a[5] != b[5])
    return (int) a[1] - (int) b[5];
  if (a[6] != b[6])
    return (int) a[6] - (int) b[6];
  return (int) a[7] - (int) b[7];
}</pre>
<p>There must be "return (int) a[5] - (int) b[5];" instead of "return (int) a[1] - (int) b[5];".</p>
<p>Why does nobody write such examples? Note that PVS-Studio has found this error in the MySQL project.</p>
<p>The conclusion is, adequate investigation or comparison of tools can be carried out only with real projects. You take project A, test it with PC-Lint / Visual C++ / PVS-Studio / C++Test, study all the messages attentively, draw up a table of results (how many and which errors each analyzer has found). This is the only real investigation and comparison. For example: "<a href="http://www.viva64.com/en/a/0073/">Comparing the general static analysis in Visual Studio 2010 and PVS-Studio by examples of errors detected in five open source projects</a> ".</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/02/08/myths-about-static-analysis-the-fifth-myth-a-small-test-program-is-enough-to-evaluate-a-tool/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Coarse-grained locks and Transactional Synchronization explained</title>
		<link>http://software.intel.com/en-us/blogs/2012/02/07/coarse-grained-locks-and-transactional-synchronization-explained/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/02/07/coarse-grained-locks-and-transactional-synchronization-explained/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 22:55:02 +0000</pubDate>
		<dc:creator>James Reinders (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Haswell]]></category>
		<category><![CDATA[HLE]]></category>
		<category><![CDATA[RTM]]></category>
		<category><![CDATA[transactional memory]]></category>
		<category><![CDATA[TSX]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/02/07/coarse-grained-locks-and-transactional-synchronization-explained/</guid>
		<description><![CDATA[Coarse-grained locks, and the importance of transactions, are key concepts that motivate why Intel Transactional Synchronization Extensions (TSX) is useful.  I’ll do my best to explain them in this blog. In my blog “Transactional Synchronization in Haswell,” I describe new instructions (Intel TSX) that will improve the performance of coarse-grained locks.  Understanding coarse-grained locks and [...]]]></description>
			<content:encoded><![CDATA[<p>Coarse-grained locks, and the importance of transactions, are key concepts that motivate why Intel Transactional Synchronization Extensions (TSX) is useful.  I’ll do my best to explain them in this blog.</p>
<p>In my blog “<a href="../../../../2012/02/07/transactional-synchronization-in-haswell">Transactional Synchronization in Haswell</a>,” I describe new instructions (Intel TSX) that will improve the performance of coarse-grained locks.  Understanding coarse-grained locks and the concept of transactions are both key to understanding why Intel TSX matters.</p>
<p>Intel TSX may enhance performance of mutual exclusion other than simple coarse-grained locks, but I will focus on coarse-grained locking because it is common and Intel TSX allows highly concurrent accesses using only a simple locking mechanism.</p>
<p><strong>An example</strong></p>
<p>To motivate by illustration, let’s consider a simple hash table. Hash tables are used to map a <em>key</em> to a <em>key</em> and <em>value</em> pair in linear time. Two key operations are add (insert) and lookup (retrieve). Resizing and deletion are two additional operations of general interest also, but I will leave them for another time.</p>
<p>Designing a highly concurrent hash table is a non-trivial task, and there are many approaches to allow high levels of concurrency.  All these approach add complexity to the program, and often to the data structures themselves.</p>
<p>The simplest approach is a <em>single lock</em> approach. In such an approach, every operation on the hash table starts by obtaining the lock for the table and concludes by releasing the lock. While the lock is held for the operation, no other task on the system can obtain the lock and therefore no hash table operation is allowed to proceed.</p>
<p>Considering Figure 1, no concurrent operations are allowed, so each of the five operations shown would occur one at a time.</p>
<div style="text-align: center;"><img src="../../../../wordpress/wp-content/uploads/2012/01/Slide1.png" alt="" width="77%" /></p>
<p><strong>Figure 1: Five hash table operations requested</strong></div>
<p><strong>Solutions</strong></p>
<p>A common solution is to break the hash table into smaller regions, and have locks that apply to regions. While this can reduce contention, it still can create needless delays and it definitely complicates the coding and the data structure.</p>
<p>Such an approach is a prime example of taking a <strong>coarse-grained lock</strong> (a single lock for the entire hash table) and working to make it a finer grained lock (multiple locks for smaller table sections). Coarse-grained locks are easier to use, easier to understand and easier to debug.  The only disadvantage is that they tend to impede performance in a multithreaded environment. Multicore processors are increasing the likelihood of this being a problem, and help motivate new hardware assistance so that programming has a chance to stay simple more often than without assistance.</p>
<p><strong>Transactional Synchronization (Intel TSX) as a solution</strong></p>
<p><strong> </strong></p>
<p>What would be ideal, is to use the single lock (coarse-grained locking) because it is easy and not very error prone, but still have the performance of a fine-grained implementation. In our Figure 1 example, only one operation conflicts with another. This example does have more conflicts that would be expected in a real world example.</p>
<p>Considering this example, three of the operations have no collision with the other operation so the use of HLE (part of Intel TSX) on the single lock will completely elide the lock. In other words, the performance is very close to the performance of the code if no locking or unlocking code was present. The key however is that the operations are protected by the Intel TSX hardware, which has silently ensured that the protection intended by the lock is indeed assured.</p>
<p>The two operations that map to the same hash table entry will need to be staggered. This will occur even if we are unlucky enough to have them happen at the same time. In such a case, the Intel TSX will detect that the lock was indeed needed and some locking overhead will be incurred. What would actually happen in such a case, is that the colliding tasks will proceed into the protected code until the processor detects the conflict. As such a point, both updates will abort their protected code (also called the transaction). The most common solution then is to have each task proceed but actually enforce the lock on the second try. This means that one task will win, and delay the other, until the operation is complete. The precise decision on how to handle the collision is either up to the processor implementation with HLE, or the programmer with RTM. The processor implementation for HLE will also be fairly simple and conservative, in order to preserve the semantics of the original lock and hence compatibility with processors that lack Intel TSX.</p>
<p><strong>Summary</strong></p>
<p>For a hash map, Intel TSX allows for the right things to occur without losing the protection that the locks need to give. Intel TSX ensures the same results as the coarse-grained lock guarantees, but allows unrelated operations to proceed without delays that the coarse-grained locks would have caused. For more information on Transactional Synchronization, see my blog on <a href="http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/">Intel TSX</a>.</p>
<p>Please check out the <a href="http://software.intel.com/en-us/avx/ ">specification</a> and stay tuned for information about supporting tools from Intel and others in the coming months.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/02/07/coarse-grained-locks-and-transactional-synchronization-explained/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Transactional Synchronization in Haswell</title>
		<link>http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 22:54:33 +0000</pubDate>
		<dc:creator>James Reinders (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Haswell]]></category>
		<category><![CDATA[HLE]]></category>
		<category><![CDATA[RTM]]></category>
		<category><![CDATA[transactional memory]]></category>
		<category><![CDATA[TSX]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/</guid>
		<description><![CDATA[We have released details of Intel® Transactional Synchronization Extensions (TSX) for the future multicore processor code-named “Haswell”. The updated specification (Intel® Architecture Instruction Set Extensions Programming Reference) can be downloaded. In this blog, I’ll introduce Intel TSX and provide a little background. Please refer to The Transactional Synchronization Extensions Chapter (Chapter 8) in the manual [...]]]></description>
			<content:encoded><![CDATA[<p>We have released details of Intel® Transactional Synchronization Extensions (TSX) for the future multicore processor code-named “Haswell”. The updated specification (Intel® Architecture Instruction Set Extensions Programming Reference) can be <a href="http://software.intel.com/en-us/avx">downloaded</a>.</p>
<p>In this blog, I’ll introduce Intel TSX and provide a little background. Please refer to The Transactional Synchronization Extensions Chapter (Chapter 8) in the <a href="http://software.intel.com/en-us/avx/">manual </a> for additional information. These new synchronization extensions (Intel TSX) are useful in shared-memory multithreaded applications that employ lock-based synchronization mechanisms.</p>
<p>In a nutshell, Intel TSX provides a set of instruction set extensions that allow programmers to specify regions of code for transactional synchronization. Programmers can use these extensions to achieve the performance of fine-grain locking while actually programming using coarse-grain locks. I have written a simple illustrative example in my blog “<a href="http://software.intel.com/en-us/blogs/2012/02/07/coarse-grained-locks-and-transactional-synchronization-explained/">Coarse-grained locks and Transactional Synchronization explained</a>.”</p>
<p>Locks are a low-level programming construct (close to the hardware), so any discussion of Intel TSX will be low level too. How Intel TSX might affect higher-level programming methods, or enable new programming models, is beyond the scope of my blog but I will briefly comment on it at the end of this blog.</p>
<p><strong>Why is this useful? </strong></p>
<p>With transactional synchronization, the hardware can determine dynamically whether threads need to serialize through lock-protected critical sections, and perform serialization only when required. This lets the processor expose and exploit concurrency that would otherwise be hidden due to dynamically unnecessary synchronization.</p>
<p>At the lowest level with Intel TSX, programmer-specified code regions (also referred to as transactional regions) are executed transactionally. If the transactional execution completes successfully, then all memory operations performed within the transactional region will appear to have occurred instantaneously when viewed from other logical processors. A processor makes architectural updates performed within the region visible to other logical processors only on a successful commit, a process referred to as an atomic commit.</p>
<p>These extensions can help achieve the performance of fine-grain locking while using coarser grain locks. These extensions can also allow locks around critical sections while avoiding unnecessary serializations. If multiple threads execute critical sections protected by the same lock but they do not perform any conflicting operations on each other’s data, then the threads can execute concurrently and without serialization. Even though the software uses lock acquisition operations on a common lock, the hardware is allowed to recognize this, <a href="http://www.merriam-webster.com/dictionary/elide">elide</a> the lock, and execute the critical sections on the two threads without requiring any communication through the lock if such communication was dynamically unnecessary.</p>
<p><strong>Intel TSX Interfaces</strong></p>
<p>Intel TSX provides two software interfaces. The first, called Hardware Lock Elision (HLE) is a legacy compatible instruction set extension (comprised of the XACQUIRE and XRELEASE prefixes) that are used to specify transactional regions. HLE is compatible with the conventional lock-based programming model. Software written using the HLE hints can run on both legacy hardware without TSX and new hardware with TSX. The second, called Restricted Transactional Memory (RTM) is a new instruction set interface (comprised of the XBEGIN, XEND, and XABORT instructions) that allows programmers to define transactional regions in a more flexible manner than is possible with HLE. Unlike the HLE extensions, but just like most new instruction set extensions, the RTM instructions will generate an undefined instruction exception (#UD) on older processors that do not support RTM. RTM also requires the programmer to provide an alternate code path for when the transactional execution is not successful.</p>
<p>In summary: “Intel Transactional Synchronization Extensions (Intel TSX) comes in two flavors: HLE and RTM. Hardware Lock Elision (HLE) is legacy compatible. Restricted Transactional Memory (RTM) offers flexibility but requires the programmer to provide an alternative code path for when transactional execution is not successful.”</p>
<p>The <a href="http://software.intel.com/en-us/avx/">specification</a> describes these extensions in detail and outlines various programming considerations to get the most out of them.</p>
<p><strong>Intel TSX Applicability</strong></p>
<p>Intel TSX targets a certain class of shared-memory multi-threaded applications; specifically multi-threaded applications that actively share data. Intel TSX is about allowing programs to achieve fine-grain lock performance without requiring the complexity of reasoning about fine-grain locking.</p>
<p>However, if there is high <em>data</em> contention the algorithm would need to change in order to have an opportunity for high scalability. There are no magic bullets that can solve the problem, since true high data contention implies that the algorithm is effectively serialized.</p>
<p><strong>Transactional Programming?</strong></p>
<p>How Intel TSX might affect higher-level programming methods, or enable new programming models, is beyond the scope of my blog. Several experimental compiler implementations, not related specifically to Intel TSX, are available including <a href="http://gcc.gnu.org/wiki/TransactionalMemory">gcc 4.7</a> which will have an experimental implementation. We can expect languages standards committees will be reviewing proposals on how to add transactional models at a language level (Intel has supported the creation of the <a href="https://sites.google.com/site/tmforcplusplus/C%2B%2BTransactionalConstructs-1.1.pdf">Draft Specification of Transaction Language Constructs for C++</a>). Intel TSX may enable a more efficient implementation of some transactional models than without Intel TSX. Much work remains to focus on real-world examples of usages and applications to develop and refine future usage. Good luck to all involved!</p>
<p>While Intel TSX may enable efficient implementations of new programming models, it does not require a new programming model and does not propose a new programming model. Intel TSX provides hardware-supported transactional-execution extensions to ease the development and improve the performance of existing programming models.</p>
<p><strong>Summary</strong></p>
<p><strong> </strong></p>
<p>Intel TSX provides extensions that allow programmers to specify regions of code for transactional execution. Programmers can use these extensions to achieve higher performance with lesser effort, for example achieve fine-grain locking performance while programming with coarser-grain locks. This is a big help and therefore big news for programmers.</p>
<p>Please check out the <a href="http://software.intel.com/en-us/avx/">specification</a> and stay tuned for information about supporting tools from Intel and others in the coming months.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sweet 16?</title>
		<link>http://software.intel.com/en-us/blogs/2012/02/06/sweet-16/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/02/06/sweet-16/#comments</comments>
		<pubDate>Mon, 06 Feb 2012 21:59:29 +0000</pubDate>
		<dc:creator>Clay Breshears (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Power Efficiency]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/02/06/sweet-16/</guid>
		<description><![CDATA[Have we already hit the maximum number of cores that can be put in our processors? Or have the needs of the user and developer communities been served at sixteen cores?]]></description>
			<content:encoded><![CDATA[<p>I just saw the article "<a title="AMD calls end to core growth on server chips" href="http://news.techworld.com/data-centre/3334884/amd-calls-end-ot-core-growth-on-server-chips/">AMD calls end to core growth on server chips</a>" at Techworld.com. The gist of the article is that AMD has decided to produce server chips with no more than 16 cores. There were some interesting future directions outlined and hinted at by the end of the article, too.</p>
<p>What seemed most disturbing to me was the limit on the number of cores being self-inflicted. Surely we can't have reached the maximum number of cores that are possible to squeeze onto a chip? The whole "right turn" idea to add cores rather than try to cool processors reaching rocket engine temperatures was less than 10 years ago. I'm not sure where the physics starts to overshadow Moore's Law, but I thought I'd  heard that a few more generations of smaller wire sizes in processor dies were still possible. So why not push more and more cores into the same package?</p>
<p>It might be that the average server application (and, perhaps even more so, consumer applications) can't scale well beyond some fixed number of cores. How many cores does it take to type and post a tweet or update your Facebook status or to watch a streaming video? Would any of those tasks be faster or somehow enhanced if there were twice the number of cores available?</p>
<p>If we stop increasing the core counts in the next 5 years, how will new chips keep fulfilling the ever-growing hunger for more performance by consumers? Maybe it won't be about faster and faster application exeuction, but more about less energy consumption while maintaining a level of performance. I guess at some point we'll stop being concerned about Gigahertz or core counts because all processors will be able to do many of the same tasks in about the same amount of time.</p>
<p>I do know that power consumption is going to be a major driving design force as HPC moves closer toward Exascale platforms.  Thus, if the THX-1138 processor draws power twice as fast as the CFM602 processor, I would be more likely to build my system equipped with the former.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/02/06/sweet-16/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Vectorization - Find out what it is, Find out More!</title>
		<link>http://software.intel.com/en-us/blogs/2012/01/31/vectorization-find-out-what-it-is-find-out-more/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/01/31/vectorization-find-out-what-it-is-find-out-more/#comments</comments>
		<pubDate>Tue, 31 Jan 2012 17:58:33 +0000</pubDate>
		<dc:creator>Shannon Cepeda (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[MIC]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[simd]]></category>
		<category><![CDATA[vectorization]]></category>
		<category><![CDATA[webinar]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/01/31/vectorization-find-out-what-it-is-find-out-more/</guid>
		<description><![CDATA[One of my performance focus areas for this year is vectorization. I am excited to start creating more content and spreading the message about this technology, as it has been a little bit underappreciated in the past. So to kick things off, I am going to launch a blog series and a 1-hour overview webinar. [...]]]></description>
			<content:encoded><![CDATA[<p>One of my performance focus areas for this year is vectorization. I am excited to start creating more content and spreading the message about this technology, as it has been a little bit underappreciated in the past. So to kick things off, I am going to launch a blog series and a 1-hour overview webinar.</p>
<p>-------------<br />
<strong>First, information about the webinar.</strong> I will be hosting this with my colleague Wendy Doerner on Feb 15th at 9AM PST. We will cover how to get started with vectorization, including examples and resources. To register or view the abstract, click this link:<br />
<a href="https://www1.gotomeeting.com/register/761784545">https://www1.gotomeeting.com/register/761784545</a><br />
If you attend the event live, you will also have the opportunity to request a followup from one of our vectorization experts!</p>
<p>-------------<br />
<strong>For the blog series I will answer 3 questions:<br />
What is Vectorization?<br />
Who Can Use It?<br />
What Are the Benefits?</strong></p>
<p>Today I will start with the first question: <em>What is Vectorization?</em></p>
<p>Vectorization is a method for achieving parallelism inside a single processor core. Vectorizing is done by using special instructions called SIMD (Single Instruction, Multiple Data) operations. SIMD instructions, and the hardware that goes along with them, have been present in Intel processors for over a decade. (Remember those commercials in the mid-90s with people dancing in bunny suits promoting MMX™ Technology? MMX was a set of SIMD instructions). The way that SIMD instructions work is that they operate on several pieces of data in parallel.<br />
In the typical (non-vectorized) case, when you add together 2 variables, they will each be stored in their own CPU register. If you perform an operation on them, such as addition, the 2 register quantities are added and the result stored back into a register. Using a SIMD instruction, you can fill a register with multiple variables to be added, which is called "packing" the register. With the most recent SIMD instruction set, Intel® Advanced Vector Instructions (Intel® AVX), which are available on Intel® Microarchitecture Codename Sandy Bridge processors, you can pack up to 32 data elements into one register. The number of elements allowed depends on the size of the element - in Intel® AVX, for example, registers are 256 bytes wide, so each can hold 32 8-byte integers, or 8 32-byte floats, or 4 64-byte floats, etc. These data elements can all be combined with another packed register full of elements, allowing you to perform multiple operations on multiple pieces of data at once. For instance, adding 2 packed SIMD registers would produce multiple results, which would be stored into a packed register as well. Being able to do these operations at once rather than one right after the other can result in significant performance gains for the right type of code.</p>
<p><img alt="" src="http://software.intel.com/file/41291" title="SIMD Add" class="alignnone" width="543" height="211" /></p>
<p>And addition is not the only operation possible on a packed register! Each set of SIMD instructions includes many different operations, with more being added in upcoming processor generations.</p>
<p>But that takes us to the next topic, Who can use vectorization, which we'll cover in the next blog. Feel free to ask questions in the comments of this blog series too, I might turn the questions into future entries. Thanks for reading!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/01/31/vectorization-find-out-what-it-is-find-out-more/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Amdahl&#039;s Law for Energy Efficient Performance Estimation?</title>
		<link>http://software.intel.com/en-us/blogs/2012/01/26/using-amdahls-law-for-energy-efficient-performance-estimation/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/01/26/using-amdahls-law-for-energy-efficient-performance-estimation/#comments</comments>
		<pubDate>Thu, 26 Jan 2012 21:24:03 +0000</pubDate>
		<dc:creator>Clay Breshears (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Power Efficiency]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/01/26/using-amdahls-law-for-energy-efficient-performance-estimation/</guid>
		<description><![CDATA[While trying to find an answer to my previous question, I stumbled across the paper "Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era" (Computer, Dec. 2008, pp. 24-31) by Dong Hyuk Woo and Hsien-Hsin S. Lee (Georgia Institute of Technology). The title had me thinking that this might be an investigation into finding [...]]]></description>
			<content:encoded><![CDATA[<p>While trying to find an answer to my <a href="http://software.intel.com/en-us/blogs/2012/01/18/how-would-you-define-energy-efficient/">previous question</a>, I stumbled across the paper "Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era" (<em>Computer</em>, Dec. 2008, pp. 24-31) by Dong Hyuk Woo and Hsien-Hsin S. Lee (Georgia Institute of Technology). The title had me thinking that this might be an investigation into finding a metric or upper bound on how energy efficient an application could be. It didn't quite turn out to be that simple, but the findings are interesting.</p>
<p>The authors look to evaluate which of three possible processor core architectures might be best for parallel execution that minimizes energy consumption. The three model core arrangements are 1) multi-core (several large processing cores on a single chip), 2) manycore (lots and lots of simpler, more power efficient cores), and 3) a combination of a single large core with many simpler cores. The first is like the current dual-, quad-, and hexa-core processors, the second is akin to a GPU, and the third is a hybrid conglomeration of large core sitting on a GPU.</p>
<p>For the purposes of the model formulas, the maximum power consumption of a single large core is normalized to 1 and the power consumption of an idle processor is an added variable, <em>k</em>. For the first architecture, the new variable is used in the traditional Amdahl's Law formula as a multiplier to the serial percentage of time multiplied by the (n-1) idle cores. Some simple algebraic manipulation and the authors generate a formula for estimating the average power consumption, in watts (<em>W</em>), for a parallel application with <em>n </em>cores and the stated percentages of parallel and serial work. A similar derivation is done for the manycore model with the power consumption per simple core being 0.25 of the large core. With the hybrid model, the assumption used to derive a corresponding formula is for the large single core to handle the serial execution and the simpler cores do the parallel work.</p>
<p>Since a measure of the watts consumed is provided by the model, the authors then compute performance per watt (<em>Perf/W</em>) by computing the original Amdahl's formula divided by the formula to compute <em>W</em>. In order to compare the three model to each other, a power budget is imposed, which sets the number of cores available for each model.</p>
<p>The conclusions drawn from comparing the model with various numbers of cores and fractions of parallel execution of the overall execution time are probably the most interesting part of the article. For example, the first result reported is that to achieve the highest <em>Perf/W</em> value in the multi-core model, the parallelization must scale linearly. If the application doesn't scale linearly,the processor (model) must dissipate more energy than the serial version since the idle power of the extra cores scales linearly.</p>
<p>The ultimate result of the paper was that the hybrid model, one large core and many small cores, was the most power scalable. The manycore option does well with high amounts of parallelism and lower power budgets (fewer total cores), but as that budget increases, the number of simple cores increases and the effective serial execution performance does not. The hybrid model, with the single large core in place of several simpler cores, can more efficiently handle the serial portions of the execution (than one simple core out of the dozens sitting idle).</p>
<p>As I was reading the paper I could identify which was the model of current standard multi-core processors available in abundance today. The manycore model could easily be a GPU or MIC accelerator by itself. The hybrid model suggested the combination of a manycore accelerator and a dual-core processor (in the absence of  heterogeneous core chips). I wondered where vector hardware fits into the three models. Considering just the vector registers alone might suggest it would be an instance of the manycore model. However, these registers are part of a larger core, which makes me think of the hybrid model. Maybe they are a second level of parallel execution that isn't accounted for in the three models.</p>
<p>It's a good paper. But I have a couple of quibbles. First, the only way that parallel execution on the multi-core model underachieves the serial equivalent execution is if the sequential code is run on a single core system. A later comment in the paper makes me think that this is the assumption, but it's not too clear. This assumption is not valid in the real-world. For a true apples-apples comparison, the serial code needs to be run on a multi-core processor, too. If that were the case, I contend that the parallel execution consumes less energy.</p>
<p>For example, assume that we have an execution time of 10 time units (let's call them <em>moops</em>) . On a quad core processor running the serial code we would have one core running full speed for 10 <em>moops </em>and the other three cores generate an aggregate 30 <em>moops </em>of idle consumption. If the algorithm is 50% parallel, we would have 5 <em>moops </em>of  full power consumption in serial, 5 <em>moops </em>of full consumption in parallel across four cores, and 15 <em>moops </em>total of idle consumption. Even if the code is 10% parallel there would only be 27 <em>moops </em>of total idle consumption. Any level of (perfect) parallelism is going to prove to consume less energy than the serial equivalent on the same system. Am I missing something?</p>
<p>Note that I included '(perfect)' at the end of the previous paragraph. There will always be overhead in parallel computations and this will expand the execution time of the parallel portions and, consequently, the full consumption time of the execution (e.g., the 50% parallel portion above might require 5.4 <em>moops </em>of full consumption).</p>
<p>Second, Amdahl's Law is an estimate of speedup. Speedup is a dimensionless number. That is, I divide the execution time of the serial code with the time of the parallel execution to get a simple  number since the <em>moops </em>of the two quantities cancel each other out in that calculation. If I need 10 <em>moops </em>of serial time versus 6.35 <em>moops </em>of parallel time, I get a 1.57X speedup. 1.57 whats? ('X' is not a unit.) Speedup is a metric of relative performance, but it's not what I really think of when I think of performance.</p>
<p>To me "performance" is more absolute. Typically this is some countable quantity like number of transactions, floating-point operations, or feet traveled. It can be associated within a time unit measure, too, like transactions per <em>moop</em>, floating-point operations per second, or furlongs per fortnight. Thus, the metrics of transactions per watt or flops per watt or feet per watt make sense to me. Improvements that raise the performance value or lower the watt value show a trend in the right direction for achieving better energy efficient performance.</p>
<p>I'm still not able to quite wrap my head around the efficacy of speedup per watt (or even speedup per joule, which is also used in the Woo and Lee paper) as an absolute measure of energy efficient performance. It may be that I'm reading too much into this and the metrics are simply used to compare the three architectural models described (within the assumptions given). Perhaps it is simpy just a model after all.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/01/26/using-amdahls-law-for-energy-efficient-performance-estimation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallel Programming is easier than separating 2 corks</title>
		<link>http://software.intel.com/en-us/blogs/2012/01/06/parallel-programming-is-easier-than-separating-2-corks/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/01/06/parallel-programming-is-easier-than-separating-2-corks/#comments</comments>
		<pubDate>Fri, 06 Jan 2012 23:47:54 +0000</pubDate>
		<dc:creator>Clay Breshears (Intel)</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[cork trick]]></category>
		<category><![CDATA[EAPF]]></category>
		<category><![CDATA[SC11]]></category>
		<category><![CDATA[Tom Murphy]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/01/06/parallel-programming-is-easier-than-separating-2-corks/</guid>
		<description><![CDATA[I've known Prof. Tom Murphy for a few years now. Whenever we were at a conference or other event together and had dinner, he invariably would ask the wait staff if they had two corks he could have. If the place served wine, it wasn't too difficult to find two corks that were the same size [...]]]></description>
			<content:encoded><![CDATA[<p>I've known Prof. Tom Murphy for a few years now. Whenever we were at a conference or other event together and had dinner, he invariably would ask the wait staff if they had two corks he could have. If the place served wine, it wasn't too difficult to find two corks that were the same size or close.</p>
<p>Upon receiving the corks, Tom would demonstrate his "cork trick" and mystify everyone that had not seen it before. After going through it three or four times he would hand the corks back to our server and have them try to do it. They would go away, sometimes showing others, as they struggled to figure out the trick. If they actually tried to recreate the solution as Tom had been able to do, they always came back before we had paid the check and triumphantly demonstrated their dexterity.</p>
<p>At SC11, the Educational Alliance for a Parallel Future (<a href="http://www.eapf.org">EAPF</a>) commissioned some corks with the organization's logo. Tom wandered around part of the conference urging attendees to try his cork trick. The tagline he used was that "Parallel Programming is easier than the cork trick." You can see a short video of his efforts to bring a little magic to the SC11 proceedings<a href="http://link.brightcove.com/services/player/bcpid741496472001?bckey=AQ~~,AAAArH1stHk~,LuRqJUw7MaeY_bnKu-CFpxLmWqzXqxwQ&amp;bctid=1337973843001"> here</a>.</p>
<p>If you meet Tom with some corks in his pocket and he brings them out to show you the cork trick, be aware that he will never show you the solution. (He says he really likes me, but I had to figure it out for myself.) Like most problems you encounter in life, very few are impossible to solve; it is just that you don't have a solution, yet.</p>
<p>Parallel programming is the same. It may seem difficult and impossible to figure out, but that only means you haven't discovered the key that will allow you to wrap your brain around the concepts.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/01/06/parallel-programming-is-easier-than-separating-2-corks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Looking Ahead to 2012</title>
		<link>http://software.intel.com/en-us/blogs/2011/12/21/looking-ahead-to-2012/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/12/21/looking-ahead-to-2012/#comments</comments>
		<pubDate>Wed, 21 Dec 2011 17:58:23 +0000</pubDate>
		<dc:creator>Shannon Cepeda (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Intel Cilk Plus]]></category>
		<category><![CDATA[Intel Many Integrated Core]]></category>
		<category><![CDATA[Intel MIC]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/12/21/looking-ahead-to-2012/</guid>
		<description><![CDATA[Well, I reflected on 2011 in my last blog, so now it's time to look ahead. My basic role will remain unchanged - I help users of our Intel® Software Development Products to achieve better performance on their applications. I will still be updating our training materials and videos for the latest mainstream Intel processors. [...]]]></description>
			<content:encoded><![CDATA[<p>Well, I reflected on 2011 in <a href="http://software.intel.com/en-us/blogs/2011/12/16/my-5-favorite-new-intel-software-development-product-features-of-2011/">my last blog</a>, so now it's time to look ahead.  My basic role will remain unchanged - I help users of our Intel® Software Development Products to achieve better performance on their applications.  I will still be updating our training materials and videos for the latest mainstream Intel processors.  And I will be helping customers to discover the latest <a href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/">Intel® VTune™ Amplifier XE</a> features.  And, since parallelism is a common path to performance, I will still be a big advocate of <a href="http://software.intel.com/en-us/articles/intel-tbb/">Intel® Threading Building Blocks (TBB)</a>.  In fact, I plan to create some new training around TBB 4.0 and the flow graph feature.  But I will also be ramping up on some new focus areas for me:</p>
<p>• <strong><a href="http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html">Intel® Many Integrated Core Architecture</a> (MIC)</strong>	 MIC (pronounced "Mike") is a new architecture that uses many low-power, single-threaded Intel® processor cores working together to provide a high degree of parallelism.  As more customers begin using the Knights Ferry development kit and the developer tool package we currently have available for MIC, I will start studying the architecture and conducting some experiments too.  My first project will be to try to identify the most important performance aspects of MIC applications and how our tools can help developers measure them.</p>
<p>• <strong>Vectorization</strong><br />
Vectorization is parallelism within one CPU core, using special hardware that can work on more than one piece of data at once.  First off I plan to write a series of blogs explaining what vectorization is, who should be interested in it, how to achieve it, and why it's important.  Then I hope to work with some customers directly on this.  One reason why vectorization will be of big interest to me is that it will definitely be a big help to performance on the MIC architecture.</p>
<p>• <strong><a href="http://software.intel.com/en-us/articles/intel-cilk-plus/">Intel® Cilk Plus</a></strong><br />
Fitting hand in hand with the 2 focus areas above is Intel® Cilk Plus, our open source parallelism model that also includes vectorization support.  Although I have worked with and evangelized Cilk Plus a lot this past year, I had mainly been looking at the parallelism part.  Next year, I will spend more time on the vectorization part of Cilk Plus.  The other reason to focus here is that Cilk Plus code, like TBB code, will also run on the MIC architecture.  </p>
<p>So get ready to hear more from me on the above topics!  And Happy Holidays!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/12/21/looking-ahead-to-2012/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scalable Memory Pools: community preview feature</title>
		<link>http://software.intel.com/en-us/blogs/2011/12/19/scalable-memory-pools-community-preview-feature/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/12/19/scalable-memory-pools-community-preview-feature/#comments</comments>
		<pubDate>Mon, 19 Dec 2011 13:05:33 +0000</pubDate>
		<dc:creator>Anton Malakhov (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[memory pool]]></category>
		<category><![CDATA[scalable allocator]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[TBB 4.0]]></category>
		<category><![CDATA[tbbmalloc]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/12/19/scalable-memory-pools-community-preview-feature/</guid>
		<description><![CDATA[In TBB 4.0, we introduced new community preview feature (CPF) – the scalable memory pools. See the TBB Reference Manual (D.4) for formal and detailed description. In this blog, we will present them less formally and discuss what changes can be made. Motivation We had vague requests from customers to implement a memory pool (Wikipedia [...]]]></description>
			<content:encoded><![CDATA[<p>In TBB 4.0, we introduced new community preview feature (<a title="About Community Preview Features" href="http://software.intel.com/en-us/articles/intel-tbb-community-preview-features/">CPF</a>) – the scalable memory pools. See the TBB <a href="http://threadingbuildingblocks.org/documentation.php">Reference Manual</a> (D.4) for formal and detailed description. In this blog, we will present them less formally and discuss what changes can be made.</p>
<h2>Motivation</h2>
<p style="text-align: justify;">We had vague requests from customers to implement a memory pool (Wikipedia calls it <a href="http://en.wikipedia.org/wiki/Region-based_memory_management">region</a>) or some of its properties in the TBB scalable memory allocator. We summarized these requests and general information on memory pools from the Internet and got the following compilation of major properties and abilities:</p>
<ul>
<li>Memory pools basically do the same job as standard memory allocators but additionally group memory objects under umbrella of a specific pool instance which enables:
<ul>
<li>fast deallocation of all the memory at once on pool destruction or for sake of further reuse</li>
<li>less memory fragmentation and related synchronization between independent groups</li>
</ul>
</li>
<li>Memory pools allow more control over acquisition and release of memory resources, and may have user-specific sources of memory:
<ul>
<li>memory chunk/buffer of a fixed size</li>
<li>redirection to a specific memory provider, e.g. standard or custom implementation of malloc, big memory pages, memory tied to specific NUMA node, IPC shmem regions.</li>
</ul>
</li>
</ul>
<p style="text-align: justify;">To squeeze more performance and to fight memory fragmentation, some specific implementations allocate objects of fixed size only (so called object pools, e.g. <a href="http://www.boost.org/doc/libs/1_48_0/libs/pool/doc/html/index.html">boost::pool</a>, Wikipedia calls it <a title="Wiki" href="http://en.wikipedia.org/wiki/Memory_pool">memory pool</a>) or are unable to deallocate individual object ("arena allocator"). In our implementation, we tried to provide more general functionality in thread-safe and scalable way. For that purpose, the implementation of the memory pools is based on TBB scalable memory allocator and so has similar speed and memory consumption properties. Later we may address more specific use cases, based on the feedback.</p>
<h2>Usage</h2>
<p style="text-align: justify;">Our memory pools API consists of two classes for thread-safe memory management: <em>tbb::fixed_pool</em> and <em>tbb::memory_pool</em>. The first one is for the simple case when an already allocated memory block and is used for allocation of smaller objects. And the second one utilizes a user-specified memory provider to obtain big chunks of memory where smaller objects reside. As opposed to fixed_pool, memory_pool is able to grow on demand and relinquish unused chunks back to the provider.</p>
<p>Both classes provide familiar methods for allocation and deallocation:</p>
<pre name="code" class="cpp:nogutter:nocontrols">void *ptr = my_pool.malloc( (size_t) 10 );  // allocate 10 bytes
ptr = my_pool.realloc( ptr, (size_t) 12 );  // extend the allocation to 12 bytes
my_pool.free( ptr );                        // deallocate it</pre>
<p>Additionally, there is a method which deallocates all the memory at once, i.e. it is a faster equivalent to a series of calls to my_pool.free() for each pointer obtained in this pool by previous calls to my_pool.malloc():</p>
<pre name="code" class="cpp:nogutter:nocontrols">my_pool.recycle();  // Frees all the memory in the pool for reuse</pre>
<p>Please note, that it is not thread-safe to call it concurrently to other methods on the same instance (similarly to clear() method in containers).<br />
We also provide an (almost, except absence of default constructor) STL-compliant allocator class to enable pools inside STL containers:</p>
<pre name="code" class="cpp:nogutter:nocontrols">typedef tbb::memory_pool_allocator&lt;int&gt; pool_allocator_t;
std::list&lt;int, pool_allocator_t&gt; my_list( (pool_allocator_t( my_pool )) );</pre>
<p>Now, the only thing that holds us back from the first experiment with this new feature of TBB is the question – how to create the ‘my_pool’.  First, we need to enable this feature and include the header:</p>
<pre name="code" class="plain:nogutter:nocontrols">#define TBB_PREVIEW_MEMORY_POOL 1
#include “tbb/memory_pool.h”</pre>
<p>If you want to create a memory pool on top of your memory block, let’s specify its address and size in bytes to the constructor of tbb::fixed_pool class, as in following excerpt:</p>
<pre name="code" class="cpp:nogutter:nocontrols">char buffer[1024*1024];
// The casts below are just to show the types of arguments.
tbb::fixed_pool my_pool( (void*)buffer, (size_t)1024*1024*sizeof(char) );</pre>
<p style="text-align: justify;">The maximal amount of memory which can be allocated from the pool declared above is limited by size of the buffer minus some space for control structures. And if you want to avoid this limitation, let’s use tbb::memory_pool template class specifying memory provider (which will be discussed later) as its template argument:</p>
<pre name="code" class="cpp:nogutter:nocontrols">tbb::memory_pool&lt; std::allocator&lt;char&gt; &gt; my_pool(/*optionally: allocator instance*/);</pre>
<p style="text-align: justify;">You can specify any STL-compatible allocator as the memory provider (though this is a subject to change). It will provide (big) memory chunks for  my_pool when necessary. The destructor of the memory_pool class implies release of all the memory chunks back to the memory provider.</p>
<p>Let’s consolidate our knowledge in one artificial example:</p>
<pre name="code" class="cpp">// Link this with tbbmalloc library
#define TBB_PREVIEW_MEMORY_POOL 1
#include "tbb/memory_pool.h"
#include &lt;list&gt;
#include &lt;stdio.h&gt;

void main() {
    static char buf[1024*1024*4]; // buffer for interim data
    tbb::fixed_pool interim_pool(buf, sizeof(buf)); // pool for temporary objects
    tbb::memory_pool&lt; std::allocator&lt;char&gt; &gt; result_pool; // pool to store the results

    typedef tbb::memory_pool_allocator&lt;int&gt; result_allocator_t; // interface to STL containers
    std::list&lt;int, result_allocator_t&gt; result_list( (result_allocator_t( result_pool )) );

    for(int result = 0, i = 0; i &lt; 100; i++, result = 0) {
        for(int j = 0; j &lt; 1000000; j++) {
            int *p = (int*)interim_pool.malloc(4);
            if( p ) result++; // really dummy :)
        }
        // in real application, here can be some processing of allocated objects
        result_list.push_back(result); // no memory fragmentation here - separate pool
        interim_pool.recycle(); // free all the interim objects
        printf("%d\n", result); // should be the same number on each iteration
    }
} // all the memory is released back implicitly</pre>
<p style="text-align: justify;">The simple part is done, and I hope that you are interested enough to proceed with more complex questions, and tell us what you think about it.</p>
<p style="text-align: justify;">Someone may want to know whether it is possible to construct a pool in a memory allocated form another pool. It is possible, but one should take care to destroy the inner pool prior to destruction of the outer pool or a call to recycle(). Do you know a good reason to enable such a nesting?</p>
<h2>Memory provider interface</h2>
<p style="text-align: justify;">From an API designer perspective, the memory provider is the most questionable part of the scalable pools API. And since it is yet a community preview feature, you are welcome to influence its design. Curious readers might want to ask questions like the following:</p>
<ul>
<li>what are the requirements for the template argument?</li>
<li>why is std::allocator used as a memory provider?</li>
<li>why the type used with std::allocator in examples above is “char”?</li>
</ul>
<p style="text-align: justify;">The template argument of tbb::memory_pool accepts a memory provider class which satisfies minimal requirements of STL compatible allocator according to the last C++11 standard: <strong>allocate </strong>and<strong> deallocate</strong> methods, and a <strong>value_type</strong> definition.</p>
<p style="text-align: justify;">Using std::allocator and compatible classes is perhaps the most straight-forward way to enable memory_pool anywhere. However from efficiency standpoint, it makes probably not much sense because such allocators are intended for rather small objects by design while memory provider should operate with megabytes. For users who don’t care what the memory provider is, we could better provide a default one instead which would map to system-default way for memory mapping.</p>
<p style="text-align: justify;">And finally, TBB memory pools don’t really need the type of allocation (i.e. <strong><em>char</em></strong> in the declaration of tbb::memory_pool&lt;std::allocator&lt;<strong>char</strong>&gt;&gt;), but rather need to know the granularity of requests to the memory provider. And this is not only specification for type of arguments for allocate and deallocate, this information is used in our implementation to determine size of memory requests to memory provider. For example, consider big pages which can be mapped only by chunks of megabytes:</p>
<pre name="code" class="cpp">// A custom memory provider for memory_pool
class big_pages {
public:
    typedef char[2*1024*1024] value_type;
    void *allocate(size_t pages) {
        return mmap(0x0UL, pages*2*1024*1024, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);
    }
    // the pointer type requirement is also actually relaxed
    void deallocate(void *ptr, size_t pages) {
        munmap(ptr, pages*2*1024*1024);
    }
};
// usage:
tbb::memory_pool&lt;big_pages&gt; my_pool;</pre>
<h2>Some food for thoughts</h2>
<p>The way granularity is specified in the line 4 in the above example is not straight-forward and can be viewed as confusing. This is the price of STL-compliant interface of the memory provider and we are not sure if it has more pros than cons:</p>
<ul>
<li>STL compatibility is supposed to reuse widely implemented memory allocators.
<ul>
<li>On the other hand, these allocators are usually purposed for small sizes of allocations but a pool will need memory chunks of at least hundreds of kilobytes.</li>
</ul>
</li>
<li>In theory, it allows easy nesting of memory pools using our memory_pool_allocator class.
<ul>
<li>But we studied that nesting of the pool in some other implementations does not mean reusing the memory allocated by parent pool but rather a hierarchy of pool objects.</li>
<li>And such a nesting is not yet supported anyway</li>
</ul>
</li>
<li>It is easier to remember the requirements based on well-known standard interface</li>
<li>Granularity is a property of the memory provider and must be passed along with it</li>
</ul>
<p style="text-align: justify;">As an alternative interface, we consider to make the granularity explicitly specified but in a separate trait class which should be specialized only for the memory providers with granularity of allocations &gt; 1. It is even possible to keep STL-compatibility using metaprogramming magic, e.g. define the granularity to sizeof(value_type) if value_type defined.</p>
<p style="text-align: justify;">Another question is how to introduce alignment in the interface of memory pools. Basically, it can be either aligned_malloc() and aligned_realloc(), or an optional argument for malloc() and realloc() methods.</p>
<p>Also, are the suggested class names good, or do we need to find better names (for instance, "fixed_region" and "dynamic_region" to align with terms of Wikipedia)?</p>
<h2>Feedback is very welcome</h2>
<p>We are very eager to hear from you what do you think about above and how can it be used in your projects.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/12/19/scalable-memory-pools-community-preview-feature/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>My 5 Favorite New Intel® Software Development Product Features of 2011</title>
		<link>http://software.intel.com/en-us/blogs/2011/12/16/my-5-favorite-new-intel-software-development-product-features-of-2011/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/12/16/my-5-favorite-new-intel-software-development-product-features-of-2011/#comments</comments>
		<pubDate>Fri, 16 Dec 2011 18:41:39 +0000</pubDate>
		<dc:creator>Shannon Cepeda (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Intel Cilk Plus]]></category>
		<category><![CDATA[Intel Cluster Studio XE]]></category>
		<category><![CDATA[Intel Software Development Products]]></category>
		<category><![CDATA[Intel VTune Amplifier XE]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[TBB 4.0]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/12/16/my-5-favorite-new-intel-software-development-product-features-of-2011/</guid>
		<description><![CDATA[It's been a big year for us in the Intel Developer Products Division. We released Intel® Cluster Studio XE and Intel® Parallel Studio XE Service Pack 1. We continued to plan and design our products to provide support for the compute continuum. And of course we worked to grow our community of developers. Throughout the [...]]]></description>
			<content:encoded><![CDATA[<p>It's been a big year for us in the Intel Developer Products Division. We released <a href="http://software.intel.com/en-us/articles/intel-cluster-studio-xe/">Intel® Cluster Studio XE</a> and <a href="http://software.intel.com/en-us/articles/intel-parallel-studio-xe/">Intel® Parallel Studio XE Service Pack 1</a>. We continued to plan and design our products to provide support for the compute continuum. And of course we worked to grow our community of developers. Throughout the year there have been several new features and developments in some of my favorite products - below I list my personal top 5 and tell you why. This list is of course heavily biased by my particular area of expertise (performance) and is by no means a complete list of all the new products or features that went into Intel® Software Development products in 2011!  So, without further ado, my favorites:</p>
<p>5. <a href="http://software.intel.com/en-us/articles/intel-cilk-plus-open-source/">Intel® Cilk Plus open source port to GCC</a> - <a href="http://software.intel.com/en-us/articles/intel-cilk-plus/">Intel® Cilk Plus</a> was announced in 2010, and an open source specification has been out since late 2010 as well. However this year we began, along with the open source community, to port Cilk Plus to GCC. Some of the first items ported were the parallelism keywords, which is significant to me because it makes our Cilk Plus parallelism model available to a greater audience.</p>
<p>4. <a href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe">Intel® VTune™ Amplifier XE</a> and <a href="http://software.intel.com/en-us/articles/intel-inspector-xe/">Intel® Inspector XE</a> MPI Support - In the new Cluster Studio XE product, VTune Amplifier XE and Inspector XE are now MPI-enabled. This is important because we are beginning to see more hybrid programming in the HPC and cluster world - which means the applications use a combination of MPI and another threading model (such as OpenMP, Cilk Plus, or <a href="http://software.intel.com/en-us/articles/intel-tbb/">Intel® Threading Building Blocks</a>). We have an existing product, <a href="http://software.intel.com/en-us/articles/intel-trace-analyzer/">Intel® Trace Analyzer and Collector</a>, that analyzes MPI efficiency for a cluster app, but analyzing performance of an individual process running on an MPI rank was more difficult. Now we make it easier to use VTune Amplifier XE or Inspector XE to analyze the threading model used within a rank, which helps us support more cluster customers. </p>
<p>3. <a href="http://drdobbs.com/tools/231900177">Intel® Threading Building Blocks Flow Graph</a> - I was introduced to flow graph this year, when I worked with my colleague Victoria Gromova to create some TBB labs for Intel Developer Forum. Victoria wanted to highlight flow graph as one of the new features of <a href="http://threadingbuildingblocks.org/">TBB 4.0</a>. Flow graph is a new construct that supports many more types of control algorithms, like dependency graphs, event-based models or reactive-based flows. In short, it opens up TBB to more customers while maintaining or improving the TBB performance we have come to expect. </p>
<p>2. <a href="http://software.intel.com/en-us/articles/intel-parallel-studio-xe/#whatsnew">VTune Amplifier XE attach to running process on Linux*</a> - This is a great example of our development team responding to customer feedback. Being able to analyze a running process for a defined period of time (instead of launching it) has been requested by many of our clients. We first got this implemented on Windows*, then this September provided the feature for Linux* in <a href="http://softtalkblog.com/2011/09/13/intel-parallel-studio-xe-2011-service-pack-1-is-released/">Intel® Parallel Studio XE Service Pack 1</a>. I have already been visiting some users who requested this and it is great to be able to share that the feature they have been asking for is here!</p>
<p>1. <a href="http://software.intel.com/en-us/blogs/2011/06/27/what-weve-been-doing-to-make-performance-analysis-easier-on-intel-microarchitecture-codename-sandy-bridge/">VTune Amplifier XE interface for Intel® Microarchitecture Codename Sandy Bridge</a> - For readers of my blog this one should not be a surprise! I have created <a href="http://software.intel.com/en-us/articles/two-part-webinar-and-two-videos-posted-all-covering-sandy-bridge-performance-tuning/">quite a bit of training material </a>on these new Sandy Bridge features. We now provide an analysis type for Sandy Bridge that helps users easily identify the most common software performance issues at the microarchitectural level, and it includes pre-coded metrics, thresholds, and issue highlighting for usability. This is my favorite new feature because, even though I am not a developer, I got to help a little with making this interface by helping define some performance metrics and thresholds and validating them on workloads. It is very cool to see my contributions in the product.</p>
<p>There you have it! I hope you have a chance to try out some of our new product features now or in the coming year. Let us know your favorites, or your requests.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/12/16/my-5-favorite-new-intel-software-development-product-features-of-2011/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>As fall Idaho twins, so falls Twin Falls, ID</title>
		<link>http://software.intel.com/en-us/blogs/2011/12/14/as-fall-idaho-twins-so-falls-twin-falls-id/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/12/14/as-fall-idaho-twins-so-falls-twin-falls-id/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 23:24:06 +0000</pubDate>
		<dc:creator>Clay Breshears (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Site News & Announcements]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/12/14/as-fall-idaho-twins-so-falls-twin-falls-id/</guid>
		<description><![CDATA[A chapter has closed on my career here at Intel. I hope this post isn't too maudlin.]]></description>
			<content:encoded><![CDATA[<p>If you've read <a href="http://software.intel.com/en-us/blogs/2011/12/14/the-last-show-parallel-programming-talk-130-parallel-manifold-with-jim-dempsey/">Kathy's blog and Show Notes</a> for <em>Parallel Programming Talk #130</em>, then you know the sad news. This was the last show we'll be doing. Kathy and I are moving on to different duties within Intel. Ironically, over the last three months I've had quite a few people tell me that they had just found the show and were enjoying the episodes that they had seen. Luckily for them, there will always be this and the previous 129 episodes available for <a href="http://www.intel.com/software/parallelprogrammingtalk">online viewing</a>.</p>
<p>I want to thank Kathy Farrel for all her hard work in organizing and taking the lead on the last 40 shows. She had some fresh ideas during our collaboration and I enjoyed working with her. Kathy was taking on the role of Parallel Programming Community Manager and it seemed like she was asking me at least one question every day about the subject.  She started out a bit camera shy and tongue-tied. But she made steady improvement, started to explore parallel programming topics on her own, and soon got comfortable with the hosting duties. I am impressed with her drive and tenacity. She will do more great things in her <a title="vPro Developer Web Site" href="http://software.intel.com/en-us/vPro">new role</a>.</p>
<p>Aaron Tersteeg deserves a big "Thank You" for developing <em>Parallel Programming Talk </em>back in 2008, first on blogtalkradio and then as an Intel Software Network web video show. When he first approached me about participating, he described it as "<a href="http://www.cartalk.com">Car Talk</a>" but focused around parallel programming topics. We tried to keep things both informal and informative. I was relegated to Aaron's monitor when we started the video shows (as he is in Oregon and I am in Illinois), which led to some light-hearted moments, some accidents, and some experimentation as we played with this restriction.</p>
<p>(My favorite anecdote from the early video days was when Aaron ran into a fan of the show at a Portland Trailblazers game. The fan knew he looked familiar and then asked if he was Clay Breshears.)</p>
<p>The production crew will always have my undying respect and appreciation. Jerry Makare and Josh Bancroft ushered the video era into existence and have always driven the technology and production values to deliver a higher quality product. They were always up for a challenge and conquered many of them during the show's run. I also appreciate all the work that the video production interns--Chris Davis and Anthony Lopez--did for the show. They worked tirelessly behind the scenes moving, setting up, and tearing down equipment, they did some of the editing chores, and they were always great fun during those rare times I was in town for a live shoot.</p>
<p>And finally, I want to thank the fans and casual viewers of the show. Thank you for all your support, questions, and comments. Without you we would have shut down soon after we started. During my tenure as co-host of <em>Parallel Programming Talk</em>, I got to meet some of the superstars of the field, got to hear more about cool parallel languages and approaches to parallel programming, and got to see some cool tools that are useful in making parallel programming, debugging and tuning much easier. I hope that you enjoyed hearing from the experts and finding out about new technology at least half as much as I did.</p>
<p>Keep writing parallel code and be good to each other.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/12/14/as-fall-idaho-twins-so-falls-twin-falls-id/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Last Show - Parallel Programming Talk #130 - Parallel Manifold with Jim Dempsey</title>
		<link>http://software.intel.com/en-us/blogs/2011/12/14/the-last-show-parallel-programming-talk-130-parallel-manifold-with-jim-dempsey/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/12/14/the-last-show-parallel-programming-talk-130-parallel-manifold-with-jim-dempsey/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 19:43:59 +0000</pubDate>
		<dc:creator>Kathy Farrel (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Clay Breshears]]></category>
		<category><![CDATA[Jim Dempsey]]></category>
		<category><![CDATA[Kathy Farrel]]></category>
		<category><![CDATA[Parallel Manifold]]></category>
		<category><![CDATA[ParallelProgrammingTalk]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/12/14/the-last-show-parallel-programming-talk-130-parallel-manifold-with-jim-dempsey/</guid>
		<description><![CDATA[It was sad when "Friends" ended, and who can forget the endings of "MASH" and "Jerry Seinfeld"? So it goes with ISN's "Parallel Programming Talk" show.  Our last show was a fitting end to 130 radio and Web TV programs about all things Parallel. It wasn’t a “montage” or “retrospective” show but simply a great [...]]]></description>
			<content:encoded><![CDATA[<p>It was sad when "Friends" ended, and who can forget the endings of "MASH" and "Jerry Seinfeld"? So it goes with ISN's "Parallel Programming Talk" show.  Our last show was a fitting end to 130 radio and Web TV programs about all things Parallel. It wasn’t a “montage” or “retrospective” show but simply a great interview with one of our good friends and favorite guests, ISN Black Belt Developer Jim Dempsey. We discussed a new concept of Jim’s – Parallel Manifold – which you will be hearing more about on <a href="http://software.intel.com/en-us/blogs/2011/10/14/have-your-cake-and-eat-it-too/">Jim’s blog</a> and in his technical articles. Thanks to Jim for making time for us.</p>
<p>Before I sign off here (on PPT – I am, still here in a <a title="vPro Developer Web Site" href="http://software.intel.com/en-us/vPro">new role </a>and my blog will continue) I want to thank my cohost <a href="http://software.intel.com/en-us/blogs/author/clay-breshears/feed/">Clay Breshears</a> for his support in this ride. His expertise, encouragement and exceptional sense of humor contributed heavily to the success of the show. Thanks to our outstanding guests for sharing the info that has drawn our audience. Special thanks to our crew: Jerry Makare, <a title="http://intel.com/software/media" href="http://intel.com/software/media">Technical Director, Intel Software Videos</a>, Producer <a href="http://tinyscreenfuls.com/">Josh Bancroft</a> and Videographer Chris Davis,  for early morning shoots, patience through the technical challenges and for their continuing support (new phrase - for the love of Skype).</p>
<p>Extra thanks to <a href="http://mediumtall.wordpress.com/">Aaron Tersteeg</a>, whose brainchild "Parallel Programming Talk" was and is. He is one of the most forward thinking people I know anywhere, whose wisdom and energy fuels a great deal of the goodness here at the Intel Software Network.  </p>
<p>Enjoy the video:<br />
<object id="flashObj" width="640" height="360" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,47,0"><param name="movie" value="http://c.brightcove.com/services/viewer/federated_f9?isVid=1&#038;isUI=1" /><param name="bgcolor" value="#FFFFFF" /><param name="flashVars" value="videoId=1305106771001&#038;playerID=741496472001&#038;playerKey=AQ~~,AAAArH1stHk~,LuRqJUw7MaeY_bnKu-CFpxLmWqzXqxwQ&#038;domain=embed&#038;dynamicStreaming=true" /><param name="base" value="http://admin.brightcove.com" /><param name="seamlesstabbing" value="false" /><param name="allowFullScreen" value="true" /><param name="swLiveConnect" value="true" /><param name="allowScriptAccess" value="always" /><embed src="http://c.brightcove.com/services/viewer/federated_f9?isVid=1&#038;isUI=1" bgcolor="#FFFFFF" flashVars="videoId=1305106771001&#038;playerID=741496472001&#038;playerKey=AQ~~,AAAArH1stHk~,LuRqJUw7MaeY_bnKu-CFpxLmWqzXqxwQ&#038;domain=embed&#038;dynamicStreaming=true" base="http://admin.brightcove.com" name="flashObj" width="640" height="360" seamlesstabbing="false" type="application/x-shockwave-flash" allowFullScreen="true" allowScriptAccess="always" swLiveConnect="true" pluginspage="http://www.macromedia.com/shockwave/download/index.cgi?P1_Prod_Version=ShockwaveFlash"></embed></object></p>
<p>Direct Video Link: http://software.intel.com/en-us/blogs/2011/12/14/the-last-show-parallel-programming-talk-130-parallel-manifold-with-jim-dempsey/</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/12/14/the-last-show-parallel-programming-talk-130-parallel-manifold-with-jim-dempsey/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MIC: Stepping-stone to Quantum Computing?</title>
		<link>http://software.intel.com/en-us/blogs/2011/12/14/mic-stepping-stone-to-quantum-computing/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/12/14/mic-stepping-stone-to-quantum-computing/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 17:59:41 +0000</pubDate>
		<dc:creator>Clay Breshears (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[MIC]]></category>
		<category><![CDATA[QRAM]]></category>
		<category><![CDATA[quantum computation]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/12/14/mic-stepping-stone-to-quantum-computing/</guid>
		<description><![CDATA[I was reading Quantum Computing for Computer Scientists by Noson S. Yanofsky and Mirco A. Mannucci while I was on the treadmill last night. I started out reading the description of Shor's algorithm (for factoring integers) and thought that implementing this on a classical computer (in parallel, of course) would make an interesting problem for the Intel [...]]]></description>
			<content:encoded><![CDATA[<p>I was reading <em><a href="http://www.cambridge.org/us/knowledge/isbn/item1174708/?site_locale=en_US">Quantum Computing for Computer Scientists</a> </em>by Noson S. Yanofsky and Mirco A. Mannucci while I was on the treadmill last night. I started out reading the description of <a href="http://en.wikipedia.org/wiki/Shor%27s_algorithm">Shor's algorithm</a> (for factoring integers) and thought that implementing this on a classical computer (in parallel, of course) would make an interesting problem for the Intel Threading Challenge contest.</p>
<p>But what really caught my imagination was the first section of Chapter 7, "Programming Languages," that briefly described the Quantum Random Access Machine (QRAM) model of quantum computation. In addition to the few paragraphs that were devoted to this model, there was a picture that showed the relationship of a classic computer to a quantum computing device. Each part was simply a box with data/instructions passed from the classic to the quantum and data (results) passed from quantum to the classic side.</p>
<p>This setup looked familiar and it came to me during my cool down: this is how a system equipped with MIC would work. That is, your Intel Core processor does some initial computation to set up data, the data is passed over to the MIC (along with the computation instructions to be executed), and the results from the MIC can be returned to the Core side for use.</p>
<p>I know that MIC processors (and other GPU-like devices) don't have the same computational power as a quantum processor could have. However, the data-parallel and SIMD execution modes are similar to how a quantum device could take a superposition of all potential input data and execute a single computation step to arrive at a measurable result. This similarity got me thinking that MIC devices could be the first steps taken by the industry to better understand, prepare for and program effective quantum computations.</p>
<p>I don't know if we will ever see commodity quantum computation devices. I doubt they'll be developed within my lifetime, at least. Even so, I am nothing short of astounded when I look back at how far computer technology has come since I wrote my first COBOL program on an IBM mainframe. </p>
<p>Knowing I should "never say never," how about on the day after I get my qPad(TM) quantum tablet device, I come back and comment on this blog post to say I was mistaken about how quickly quantum computation entered our lives? If it's anywhere in the cloud-o-sphere (and you know once these bits get pushed out, they never go away), I'll find it with the qSearch app, which will be based on the algorithm outlined in section 6.4 of Yanofsky's and Mannucci's book.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/12/14/mic-stepping-stone-to-quantum-computing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

