<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blogs &#187; Asaf Shelly</title>
	<atom:link href="http://software.intel.com/en-us/blogs/author/asaf-shelly/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<lastBuildDate>Fri, 25 May 2012 22:49:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Why you should use Procedural and OOP in every application</title>
		<link>http://software.intel.com/en-us/blogs/2012/04/30/why-you-should-use-procedural-and-oop-in-every-application/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/04/30/why-you-should-use-procedural-and-oop-in-every-application/#comments</comments>
		<pubDate>Mon, 30 Apr 2012 17:51:25 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[architecture]]></category>
		<category><![CDATA[Guest Blog]]></category>
		<category><![CDATA[Parallel Programing]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/04/30/why-you-should-use-procedural-and-oop-in-every-application/</guid>
		<description><![CDATA[Almost everyone wants to do architecture and almost everyone wants to do the UI. It means that every programmer has an opinion about the architecture and infrastructures in use. When you export an API for your system you get more opinions and when your product is an infrastructure (ex. Microsoft) you have too many opinions about [...]]]></description>
			<content:encoded><![CDATA[<p>Almost everyone wants to do architecture and almost everyone wants to do the UI. It means that every programmer has an opinion about the architecture and infrastructures in use. When you export an API for your system you get more opinions and when your product is an infrastructure (ex. Microsoft) you have too many opinions about the architecture.</p>
<p>People's architecture is usually tilted towards what they are experienced with. The architecture is based on a paradigm and people usually continue from there. A most prominent paradigm 20 years ago was Procedural Programming. Today Object Oriented Programming is the dominant one. This means that people start they system design with OOD and then ask "what's next?". In attempt to pull people from automatically using OOD I had a post called <a href="http://software.intel.com/en-us/blogs/2008/08/22/flaws-of-object-oriented-modeling/">Flaws of Object Oriented Modeling</a>, and a followup called <a href="http://software.intel.com/en-us/blogs/2011/10/21/flaws-of-object-oriented-modeling-continue/">Flaws of Object Oriented Modeling Continue</a>.</p>
<p>The truth is usually in between A and B. In this case the truth is 'all of the above'. There are many programming paradigms to employ in a single application. Failing to do so will damage: Code manageability, Response to changes in requirements, Flow management, Ability to integrate a new UI, and more. There is a long argument between OOP supporters and Procedural Programming supporters. Once in every some while someone would step in and say that you should use more esoteric things such as MVC, Aspect Oriented Programming, Pipeline, etc. Too often you would hear people suggesting what they just read about or learned about during a single session in an event. People will always want to try new things and show you that they know something special. There are other cases of course in which people are really experienced with several paradigms or are experienced with a specific paradigm and can immediately spot where it best applies. This is what you should be paying attention to.</p>
<p>What I am really looking for is a collection of Paradigm Patterns. Just as you would use a Design Pattern as a programming technique, I suggest that you also employ a Paradigm Pattern as a design technique. So where do we find these patterns? Google doesn't know...</p>
<p>There is a list of paradigms in Wikipedia (see <a href="http://en.wikipedia.org/wiki/Programming_paradigm">Programming Paradigms</a>) but it is only a list and not a pattern. A pattern should have a clear definition of how you identify where it applies, and a clear definition of how to use it. Either I am starting a new collection or someone reading these lines would comment with a reference, I will now try to create a rough list based on my own personal experience.</p>
<p>The items below are short and simple so that we don't need a full definition of the paradigm in order to understand the pattern. Obviously I will start with OOP and Procedural Programming and we'll build it from there.</p>
<p>First of all let's start with the definition of a paradigm: Programming Paradigm defines the boundaries of programming and design. From this paradigm we derive the definition of software Components, Interfaces, Programming Rules, and others. For example: there is a difference between how C# code is divided into DLL files than the way a C++ code is divided into DLL files. A C# DLL has a class as an Interface and a C++ DLL prefers global functions as an Interface. The decision when to use a goto in your C++ code is derived from the programming paradigm.</p>
<h4>Object Oriented Programming</h4>
<p>OOP is very commonly used because it allows developers to work on the same project without any interactions between them.<br />
<strong>Use</strong>: When you have multiple programmers who can't understand each other, for example one is managing an SQL database and another is doing audio processing. OOD works great for Top Level Design.<br />
<strong>Don't</strong>: When you have several developers who need to share implementation specifics, for example if you need to write a keyboard driver don't break it into fragments which hide implementation specifics from a developer working on the driver.</p>
<h4>Procedural Programming</h4>
<p>This paradigm is used for dividing a process into procedures. For example your day is a process. What you do from the time you park the car until you start reading emails is a procedure.<br />
<strong>Use</strong>: When there is a complex operation which includes dependencies between operations and a need for clear visibility of different application states ('SQL loading', 'SQL loaded', 'Network online', 'No audio hardware', etc). This is usually appropriate for application startup and shutdown.<br />
<strong>Don't</strong>: When there are many simple independent tasks to perform. Also don't use to manage UI.</p>
<h4>Model View Controller</h4>
<p>This paradigm is used often by developers who don't even know it exists. The idea behind it is the clear division between <strong>View:</strong> the data representation to the user; <strong>Model:</strong> the data / document / storage / a virtual representation of a storage (using business logic); and the [Controller] which is the user's interaction with the system. All this basically means that there is separation between how the data is represented to the user, from what the user can do with the data, from what the data really looks like. An example is Microsoft Word: The view is a text document, the controller allows printing, and the storage can be an rtf file, a doc file, or an XML based docx file.<br />
<strong>Use</strong>: Almost anytime you provide UI. Employing this paradigm allows very rapid integration of a completely new UI and fast responses to changes in UI requirements.<br />
<strong>Don't</strong>: If there is no UI, or when there is very close coupling between what the UI can do and the business logic (usually when creating a UI engine).</p>
<h4>Distributed</h4>
<p>You don't really need to use servers to have a distributed model. This paradigm states that there is no dependency between components just as with OOP, but with addition that there is also no dependency of infrastructure and object-to-object interaction should be kept to minimum.<br />
<strong>Use</strong>: Whenever different platforms or infrastructures are used and when components are completely independent of each other. For example the interaction between User Mode and Kernel Mode is usually Distributed.<br />
<strong>Don't</strong>: When data sharing has huge overhead, for interconnected modules, and between UI and Business Logic.</p>
<h4>Pipeline</h4>
<p>A Pipeline is usually made of several software components which are completely independent from each other. In this model there is usually a single data object sent from one component to the next. Most pipelines can operate completely asynchronously which makes it best for audio and video playback. Arranging components from left to right, too often a Pipeline has more than one component on a segment. For example decoding MPEG Audio frame and MPEG Video frame are two separate tasks which are independent of each other so they are both performed on the same timeslot of the Pipeline.<br />
<strong>Use</strong>: When little or no UI interaction takes place, with Audio and Video playback and encoding, and when you have a chain of operations each dealing with a different technology. For example: Read XML file, Search items, Create Records, Save to SQL server.<br />
<strong>Don't</strong>: When having multiple input types or events because there should be a different Pipeline for every type of input. Also when there is no clear correlation between Event and Response.</p>
<h4>Layers</h4>
<p>This paradigm divides the system into components just like OOP, with a huge difference - the system is divided into coherent layers. Each layer may employ OOP or Procedural internally but between layers there is a clear and simple interface. The ground rule is that requests only go from top to bottom, so a component within a lower layer cannot call a component within a higher layer. The only way to be serviced by a higher layer is by starting a new process / request. Usually the layers behave as a Distributed Pipeline.<br />
<strong>Use</strong>: When OOP can be divided into Layers, when creating a system that has UI and hardware interaction, and in very large scale systems. Example: Windows NT (and Windows 8) Kernel.<br />
<strong>Don't</strong>: Inside a Pipeline, in a small application or component, and don't create a Layer Engine inside a Layer.</p>
<p>Sounds like enough for now. You are welcome to comment with any thought.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/04/30/why-you-should-use-procedural-and-oop-in-every-application/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ACER Ultrabook Review</title>
		<link>http://software.intel.com/en-us/blogs/2012/04/03/acer-ultrabook-review/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/04/03/acer-ultrabook-review/#comments</comments>
		<pubDate>Wed, 04 Apr 2012 02:17:41 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Ultrabook]]></category>
		<category><![CDATA[Guest Blog]]></category>
		<category><![CDATA[review]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/04/03/acer-ultrabook-review/</guid>
		<description><![CDATA[Not too long ago we heard about Ultrabook machines and X86 Windows 7 systems operating on solar cells indoors and now we have Ultrabooks popping up everywhere. Between the possible options I decided that I am going to keep my DELL Latitude laptop as a workstation for now but still get a new Ultrabook for [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/IMG_1758.jpg"></a>Not too long ago we heard about Ultrabook machines and X86 Windows 7 systems operating on solar cells indoors and now we have Ultrabooks popping up everywhere. Between the possible options I decided that I am going to keep my DELL Latitude laptop as a workstation for now but still get a new Ultrabook for a different reason.</p>
<p>I initially thought that my DELL Latitude would be good for everything. It is lighter that my previous laptop and it is very powerful. In time I found myself using an iPad for many of the simpler tasks, for example attending a conference was becoming an issue with my Latitude because it was heavy and I had to be careful with it because the cover is plastic. Eventually I started using the iPad because it has metal cover, it turns on very fast, and it is not as heavy. This made my life easier until I found myself having to edit Word documents or -god forbid- open Visual Studio. This is way beyond the purpose of an iPad. Even trying to Remote-Desktop to my server proved to be worse than starting my Latitude laptop.</p>
<p>As I was watching an Ultrabooks demo when attending Intel's IDF event, the first thing that came to mind was that I would finally be able to get something that is light enough to carry, simple enough to open - use - and close, and would still have a decent keyboard and run all my existing applications.</p>
<p>For this reason I got the <a href="http://us.acer.com/ac/en/US/content/aspire-s3-ultrabook">ACER Aspire S3</a>. This looks like one of the lightweight slimmer models of Ultrabooks. As a workstation I would probably take one that has backlit keyboard and more USB ports for example. You can see the list of Ultrabooks by different manufacturers here: <a href="http://www.intel.com/content/www/us/en/ultrabook/shop-ultrabook.html">Ultrabook List</a>. The ACER Aspire S3 is one of the models which is better suited for what I was looking for. It is impressive for meetings, simple and easy to carry for full day events, travelling, and coffee-shop startup meetings where you only want a PowerPoint presentation and some Internet access without taking too much table space.</p>
<p>I have only started installation so I will cover performance on another post, after I am done setting up the system and started using it. I can tell you already that since this is not a workstation for me I did not get a system with an SSD drive. This means that I expect performance to be medium. On the other hand they have a small SSD hidden drive on the Ultrabook to allow fast Hibernation. This is interesting to test but my point is that I am not going to give it an easy time and I am going to compare performance with my workstation laptop - a DELL Latitude with Core i7, 8GB RAM and a 256GB SSD drive. I can already tell you that this Core i5 Ultrabook is already wining boot time and sleep / wakeup time, so I am not even going to compare that.</p>
<p>Right now all I can show you is that I got this device with help of Christina Green and Yair Weissler who were really helpful and understanding in the process. Eventually what I got was this huge box when I was expecting a slim Ultrabook:</p>
<p><img class="alignnone size-medium wp-image-45598" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/IMG_1748-300x225.jpg" alt="" width="412" height="309" /></p>
<p>Then I opened the large box to find out that it is mostly empty and has a smaller box inside it:</p>
<p><img class="alignnone size-full wp-image-45599" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/IMG_1750.jpg" alt="" width="320" height="240" /></p>
<p>This was thinner than the box I got for my older laptop but still looks big enough. Inside it there was another box:</p>
<p><img class="alignnone size-full wp-image-45600" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/IMG_1752.jpg" alt="" width="320" height="240" /></p>
<p>... and in it a really small box:</p>
<p><img class="alignnone size-full wp-image-45601" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/IMG_1753.jpg" alt="" width="320" height="240" /></p>
<p>That's more like it. Now it's getting exciting and there it is, an even thinner laptop. My first Ultrabook:</p>
<p><img class="alignnone size-full wp-image-45604" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/IMG_1754.jpg" alt="" width="320" height="240" /></p>
<p>You want to know how thin it really is?</p>
<p>Here it is compared to my Nokia C3 and a WD external USB drive:</p>
<p><img class="alignnone size-full wp-image-45606" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/IMG_1758.jpg" alt="" width="320" height="240" /></p>
<p>If you ask yourself, the answer is yes - it is the same height as the mobile 2.5'' drive:</p>
<p><img class="alignnone size-full wp-image-45607" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/IMG_1759.jpg" alt="" width="320" height="240" /></p>
<p>The base of the Ultrabook (without the display) is the same as my Nokia C3 which is a thin device.</p>
<p>The only performance tests I have for now are the turn on from Hibernation and turn on from Sleep. When the lid is closed the device goes to sleep in about 2 seconds. When the lid is open the system wakes up again. If the Ultrabook is in sleep mode for too long it would automatically Hibernate to save battery life. Sleep mode took almost nothing from the battery over night. If you want to save battery life then simply decrease display backlight power.</p>
<p>Here is resume from Hibernate: <span style="color: #808080"><em>(click to watch)</em></span></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/MVI_1761.wmv"><img class="alignnone size-full wp-image-45612" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/MVI_1760.AVI-Capture.jpg" alt="" width="320" height="240" /></a></p>
<p>In case you are wondering that was 7 seconds.</p>
<p>Here is the resume from Sleep. I am not even going to count that in seconds:<span style="color: #808080"> <em>(click to watch)</em></span></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/MVI_1763.wmv"><img class="alignnone size-full wp-image-45617" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/MVI_1763.AVI-Capture.jpg" alt="" width="320" height="240" /></a></p>
<p>That would be it for now. Now I am installing Visual Studio 2008 (and then 2010). All I can tell you right now is that it is relatively fast, considering not having an SSD drive. Much faster than my older workstation laptop which I got 4 years ago.</p>
<div class="mcePaste" style="width: 1px;height: 1px;overflow: hidden">﻿</div>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/04/03/acer-ultrabook-review/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
<enclosure url="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/MVI_1761.wmv" length="2861473" type="video/asf" />
<enclosure url="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/03/MVI_1763.wmv" length="1373431" type="video/asf" />
		</item>
		<item>
		<title>Pre-Release Parallel Programming and Architecture Video Series</title>
		<link>http://software.intel.com/en-us/blogs/2012/02/23/pre-release-parallel-programming-and-architecture-video-series/</link>
		<comments>http://software.intel.com/en-us/blogs/2012/02/23/pre-release-parallel-programming-and-architecture-video-series/#comments</comments>
		<pubDate>Thu, 23 Feb 2012 19:27:58 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Guest Blog]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2012/02/23/pre-release-parallel-programming-and-architecture-video-series/</guid>
		<description><![CDATA[Before I tell you the whole story here is a backstage clip: Behind The Scenes (Player) / (YouTube) / (Download) The series is a 'six pack' of videos starting from the basic introduction and covering everything you need to know in order to understand parallel computing concepts and methodologies. here is the list of chapters: 1. Change [...]]]></description>
			<content:encoded><![CDATA[<p>Before I tell you the whole story here is a backstage clip:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/02/Intro-Movie-14-Encoded-00.wmv"><img class="alignnone size-full wp-image-45035" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/02/Intro-Movie-14-Encoded-00_Thumb-Edit00.jpg" alt="Parallel Computing Academy - Behind The Scenes" width="263" height="185" /><br />
Behind The Scenes (Player)</a> / <a href="http://youtu.be/w3y_D9W0t8Q">(YouTube)</a> / <a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/02/Intro-Movie-13.wmv">(Download)</a></p>
<p>The series is a 'six pack' of videos starting from the basic introduction and covering everything you need to know in order to understand parallel computing concepts and methodologies. here is the list of chapters:</p>
<p>1. <strong>Change In Mindset</strong>: Introduction to parallel programming. First of all we need to agree that parallel computing is really easy. This video will show you that you already know parallel computing. The video also talks about the value that parallel design has for the product.</p>
<p>2. <strong>Management</strong>: Task Management, Workers, and Ownership are covered as key concepts of parallel design. In order to understand the system well we need to be able to visualize the moving parts so this video has a collection of animations demonstrating how things work under the hood.</p>
<p>3. <strong>System API</strong>: System API are the programmer's hammer and knife. This video talks about the commonly used API and also covers System API which are dedicated for parallel computing. Programming techniques are demonstrated with very important tips.</p>
<p>4. <strong>Flow Patterns</strong>: A major part of what's missing in Object Oriented Design is the concept of <em>Flow</em>. Here we cover System models, methods for handling resources and Flow Control concepts.</p>
<p>5. <strong>Phase State Programming</strong>: The majority of programmers don't really know it but they code their applications to have different operation Modes. For example Word 2010's Protected View which has an 'Enable Editing' button to switch to another mode. Most times applications have a collection of global variables defining the different states and phases of the application. It is time we call things by their name and reflect it properly in our design.</p>
<p>6. <strong>Advanced Topics</strong>: This short chapter covers advanced concepts, system models, and AVX.</p>
<p>The videos are supported with animations whenever an animation has anything to contribute.</p>
<p><span style="text-decoration: underline"><strong>On a personal note</strong></span></p>
<p>This was a huge undertaking. Great amounts of effort were required for this series of videos to come to life. It started with a weekly call while working and refining the topics and presentations, and work goes on for over a year.</p>
<p>Attending the IDF was real fun but I couldn't really relax. I had a flight right after the event taking me to Oregon. The week after the IDF we had a couple of days of video shoot. And then there is the little thing with the video editing, green-screen, animations.....</p>
<p>The series is based on many years of teaching parallel computing, mainly for academy graduates and experienced developers and architects. Some of the information can be found in my classes, some in my Microsoft Tech-Ed presentations, and some in previous posts of this blog. This series of videos is the first place that has it all collected into one coherent flow. It is known that if a book you write, or a presentation you give helps you arrange things in your mind then the content is good. So, I am very happy with what we have.</p>
<p>The first video is about to be released. I will add a post with a link to it when it is published.</p>
<p>Here is the video we shot on-site during the IDF event just before we went to shoot this series of videos:</p>
<p><a href="http://software.intel.com/en-us/videos/channel/intel-academic-community/teach-parallel-at-idf-2011-asaf-shelly/1193082728001">Watch Video Preview at IDF</a></p>
<p>You can watch one of my Tech-Ed Sessions in full video here:</p>
<p><a href="mms://mms.asyncop.com/users/asyncop/video/Parallel%20Programming%20Tutorial%20450K.wmv">Parallel Programming For Embedded</a></p>
<p>The animations in the video series have background music which I created using a MIDI editor, here is the file: <a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/02/2011-12-26-22.mid">Background Music</a>.</p>
<p>You can use the RSS feed to be notified when the videos are published. I will post a complementary blog post for every video with details and when possible slides and images.</p>
<p>The first video in the series is relatively non-technical and is suitable for both developers and product management. My wife who is HR could follow it and enjoy it. The Animations are simple and strait forward for the first video. The chapters to follow use some conventions to the animations for example for abstract objects such as a thread.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2012/02/23/pre-release-parallel-programming-and-architecture-video-series/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
<enclosure url="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/02/Intro-Movie-14-Encoded-00.wmv" length="5861770" type="video/asf" />
<enclosure url="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/02/Intro-Movie-13.wmv" length="13549671" type="video/asf" />
<enclosure url="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2012/02/2011-12-26-22.mid" length="3055" type="audio/midi" />
		</item>
		<item>
		<title>Mathematical Parallelization By Compilers</title>
		<link>http://software.intel.com/en-us/blogs/2011/10/21/mathematical-parallelization-by-compilers/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/10/21/mathematical-parallelization-by-compilers/#comments</comments>
		<pubDate>Fri, 21 Oct 2011 16:42:57 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Guest Blog]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/10/21/mathematical-parallelization-by-compilers/</guid>
		<description><![CDATA[This is not to say that compilers can automatically parallelize code. I would however really like to see that happen and here is an interesting and reliable way to parallelize operations. If a compiler can use this method of thinking then it can also be used as hints for developers writing code today. C and [...]]]></description>
			<content:encoded><![CDATA[<p>This is not to say that compilers can automatically parallelize code. I would however really like to see that happen and here is an interesting and reliable way to parallelize operations. If a compiler can use this method of thinking then it can also be used as hints for developers writing code today.</p>
<p>C and C++ languages are based on mathematical expressions. So much so that 1; is a legal operation in C\++. Other languages such as C#, Java, VB and Delphi also use mathematical operations to before actions. For example:<br />
MyInterger = GetCount() + GetLength()<br />
Both are function calls that do some work.</p>
<p>Many mathematical operations are interchangeable for example X = 1 + 2 is the same as X = 2 + 1. This means that:<br />
  X = CallFunc_A() + CallFunc_B()<br />
is the same as:<br />
  X = CallFunc_B() + CallFunc_A()</p>
<p>This is a hint telling us that CallFunc_A and CallFunc_B can run in parallel (unless internal resources are shared).</p>
<p>A more complex example would be: X = 3 * (1 + 2), and the hint now is that one operation needs to complete before the other can continue.<br />
Here is the code equivalent:<br />
  X = CallFunc_C() * ( CallFunc_A() + CallFunc_B() )</p>
<p>Here is another:<br />
  X = CallFunc_C( CallFunc_A() + CallFunc_B() )</p>
<p>These concepts apply to logical operations as well.</p>
<p>Looks like as a general rule, when we close braces we have a Conjunction Point (a Join). It makes sense because we don't really need to make sure that an operation is complete until we need its return value.</p>
<p>The question we have left now is how can we cancel an operation when it is no longer used, for example X = A or B. what if we execute A and B in parallel and B returned TRUE while A is still executing.</p>
<p>Your thoughts?</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/10/21/mathematical-parallelization-by-compilers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Flaws of Object Oriented Modeling Continue</title>
		<link>http://software.intel.com/en-us/blogs/2011/10/21/flaws-of-object-oriented-modeling-continue/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/10/21/flaws-of-object-oriented-modeling-continue/#comments</comments>
		<pubDate>Fri, 21 Oct 2011 16:36:38 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[guest blogger]]></category>
		<category><![CDATA[Software Design]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/10/21/flaws-of-object-oriented-modeling-continue/</guid>
		<description><![CDATA[I am posting this as a response to all comments to a previous post called “Flaws of Object Oriented Modeling”. That post created a live discussion that also continued to forums on other websites. It seems that we got so used to OOP and OOD that it sounds like it is the only way to go, [...]]]></description>
			<content:encoded><![CDATA[<p>I am posting this as a response to all comments to a previous post called “<a href="http://software.intel.com/en-us/blogs/2008/08/22/flaws-of-object-oriented-modeling/">Flaws of Object Oriented Modeling</a>”.</p>
<p>That post created a live discussion that also continued to forums on other websites. It seems that we got so used to OOP and OOD that it sounds like it is the only way to go, making it difficult to ask questions about it. The live discussion just shows that there are many schools of programming even though OOP was considered the last big thing (with C++, Java, C#, etc.)</p>
<p>Here are a few arguments that I am addressing. (Obviously if you agree with the article I am just going to nod).</p>
<p><span style="color: #666699;"><em>“One of the best possible solutions to manage complexity is to divide it”</em></span><br />
-   I agree. The problem is that OOD is not only dividing, it is also hiding.</p>
<p><span style="color: #666699;"><em>“Better question: is execution flow relevant?”</em></span><br />
<span style="color: #666699;"><em>“Maybe it is not necessary to control or care about the whole execution flow”</em></span><br />
<span style="color: #666699;"><em>“When you send a message to an object and you don't get the desired result then maybe you are delegating into the wrong object”</em></span><br />
-   This just to show you that inheritance is strictly enforced but some other things that have greater importance to the product are not. Something huge is missing.</p>
<p><span style="color: #666699;"><em>“With the understanding that object-oriented modeling, or rather modeling in general”</em></span><br />
-   OOD is just one way of modeling. The argument is that <span style="text-decoration: underline;">stand alone</span> it is a really bad way of modeling.</p>
<p>Just to explain where I come from, I can do OOP. Here is an example:<br />
<a href="http://www.asyncop.com/MTnPDirEnum.aspx?treeviewPath=%5bo%5d+Open-Source%5cWinUSB+Component">http://www.asyncop.com/MTnPDirEnum.aspx?treeviewPath=%5bo%5d+Open-Source%5cWinUSB+Component</a></p>
<p>I can also do procedural programming:<br />
<a href="http://www.asyncop.com/MTnPDirEnum.aspx?treeviewPath=%5bo%5d+Open-Source%5cProject+Publisher%5cProjectPublisher.cs">http://www.asyncop.com/MTnPDirEnum.aspx?treeviewPath=%5bo%5d+Open-Source%5cProject+Publisher%5cProjectPublisher.cs</a></p>
<p>These were C#. Here is some C++:<br />
<a href="http://www.asyncop.com/MTnPDirEnum.aspx?treeviewPath=%5bo%5d+Open-Source%5cWinModules">http://www.asyncop.com/MTnPDirEnum.aspx?treeviewPath=%5bo%5d+Open-Source%5cWinModules</a></p>
<p>I am not saying that OOD is bad. I am saying that it is not enough. Actually it is far from enough. There are so many things missing there.<br />
It is clear to me that people want answers and not just problems. We are going to cover this in a series of videos. Estimated approximately 3 to 4 hours in total, a nice portion is dedicated to design. We shot these videos at Intel’s Studios in Oregon right after the IDF 2011 event. Use the RSS Feed to be notified when the first segment is edited and released.<br />
A few words about this here: <a href="http://software.intel.com/en-us/videos/black-belt-developer-asaf-shelly-at-idf-2011/">http://software.intel.com/en-us/videos/black-belt-developer-asaf-shelly-at-idf-2011/</a> Timestamp 06:20. You can also see Clay Breshears there (for whomever was wondering who he was).</p>
<p>Let me start by saying that it is absolutely necessary to control the execution flow and here is the reason:</p>
<p>An application starts with an SRS document which defines the Requirements, or “the things that the application needs to do”. Moving on we have OOD which defines the objects that we use. Can you see the point? Something is missing. How do you go from “Action Required” to “Software Components”?<br />
The answer is that you have an architect in between and that architect is doing something informal which is almost an art. Can you count the projects that failed because the architecture was perfect OOD but could not meet up with changing requirements? Can you prove to the product owner that the code is meeting the SRS list without any definition of flow? Test Driven Design methodology claiming that you can create a product only by detecting faults failed so badly that I am starting to think that Darwin's Evolution is only proof that there is a god… (Can’t wait to read your comments about that :)</p>
<p>A very important thing to say: Software modeling is not OOD. The argument is not whether to use OOD or not to model your system. The argument is that OOD is a part of system modeling but was sold to us as The way to do software design.</p>
<p>Here are some problems I find with OOD and OOP:</p>
<ol>
<li>People think that by using OOD they have a good definition of their systems, but they don’t respond to the SRS list of requirements which is the point of existence for the application.</li>
<li>OOD defines “what the application is” with no regards to “what the application does”. This leads to situations in which changing requirements only reflect on code. In other words only the programmer really knows what’s going on in the code and how requirements are met.</li>
<li>OOP will enforce usage of objects to create a good software block diagram but allows other bad things such as calls ping-ponging from one object to another making bug tracking an impossible mission and thus we get the after effect Test Driven Design.</li>
<li>OOP usually comes with strange rules that produce an even nicer software block diagram but completely kill performance and requirement traceability such as “keep methods up to three lines of code”.</li>
<li>In OOP and OOD, a method and an object do not know the operations that they take part of. This is VERY bad because you cannot cancel an operation of you have no definition of what is an operation. OOP today uses the Stack and the Thread to define the operations. This produces an unbelievably bad User Experience. “Application Not Responding” only is a direct result of OOP.</li>
</ol>
<p>I am not saying that OOP is bad. It is nice. Just as we had procedural programming and then procedures became methods in an object, we need to have objects as part of something more important.</p>
<p>If Water is an object and Sugar is an object, what is Mix? I can also mix Milk and Coffee and it is the same Mix. Water and Milk implement the same Interface. Sugar and Coffee implement the same Interface.</p>
<p>When we look at this example we see when OOD is no longer describing the world, rather it is bending the world. When I started with OOD I was sold that this methodology describes the world the way we think in everyday life. I now know that this is clearly not the case. OOD is only better than Procedural Programming.. in many but not all ways.</p>
<p>Let’s talk about Flows and Layers, Phases and States, Requirements and User Experience, Objects and Graphics. Today, using OOD, we address only some of these issues and sometimes use UML to loosely define what’s missing.</p>
<p>Asaf</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/10/21/flaws-of-object-oriented-modeling-continue/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Stop Saying &quot;Lock Free Solution&quot;</title>
		<link>http://software.intel.com/en-us/blogs/2011/09/21/stop-saying-lock-free-solution/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/09/21/stop-saying-lock-free-solution/#comments</comments>
		<pubDate>Wed, 21 Sep 2011 21:24:44 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Guest Blog]]></category>
		<category><![CDATA[Parallel Computing]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/09/21/stop-saying-lock-free-solution/</guid>
		<description><![CDATA[I am writing this at the airport, just coming back from the Intel IDF event. I keep hearing that we have "Lock Free" solutions for all sorts of problems. I think that this is a really bad choice of words. Let me try to explain why: My story starts with a friend who bought a [...]]]></description>
			<content:encoded><![CDATA[<p>I am writing this at the airport, just coming back from the Intel IDF event. I keep hearing that we have "Lock Free" solutions for all sorts of problems. I think that this is a really bad choice of words. Let me try to explain why:</p>
<p>My story starts with a friend who bought a sports car. That was a really good car with A very noisy engine. A new car like that is very expensive so he just got an old one. It worked fine but then one day as we drive to a party the car just stopped. We all went out popped the hood to look at the engine. Lovely engine. Shiny but it's not running. We didn't want to be late so eventually we found the solution. We all went to the back of the car and started pushing. I admit that it wasn't going that fast but we got to the party. Later we found out that the car was out of gas. Two weeks later we went to see a movie and then again the car suddenly stopped working. Now we check the meter and it looks like the car has enough gas. So again we pop the hood and look at the engine. Eventually we find a solution. We all get behind the car and start pushing. We had a few problems up the hill so we went around it. Later we found out that this time it was the electrical system. The third time the car stopped working I told my friend that we need to find a "Push-Free" solution. I just thought that a "Push Free" solution would be more efficient. So now when we have car trouble we try to actually fix it... and we stopped calling it a Push Free solution. Now we just call it a fix.</p>
<p>Sure, it is faster and simpler to just start pushing the car instead of really understanding the problem, and you know what, pushing the car can go around almost any problem that you may have.</p>
<p>Locks can work around almost any problem that you may have. Just do me a favor and don't call it a solution.</p>
<p>When you solve a problem in your application don't call it a "Lock-Free" solution... Let's call it a solution.</p>
<p>** the friend in the story is as fictional as the need to use locks outside an infrastructure mechanism.</p>
<p>Here is more: <a href="http://software.intel.com/en-us/blogs/2009/03/24/locks-are-bad/">Locks Are Bad</a></p>
<p>Here is me at the IDF event:</p>
<p><div id="attachment_36661" class="wp-caption alignnone" style="width: 650px"><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/IMG_9137.jpg"><img class="size-full wp-image-36661" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/09/IMG_9137.jpg" alt="[Asaf Shelly Intel IDF]" width="640" height="480" /></a><p class="wp-caption-text">Asaf Shelly Intel IDF</p></div>Asaf</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/09/21/stop-saying-lock-free-solution/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Performance Presentation: Concepts Behind Parallel Computing and Extended CPU Instructions</title>
		<link>http://software.intel.com/en-us/blogs/2011/04/13/performance-presentation-concepts-behind-parallel-computing-and-extended-cpu-instructions/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/04/13/performance-presentation-concepts-behind-parallel-computing-and-extended-cpu-instructions/#comments</comments>
		<pubDate>Wed, 13 Apr 2011 16:50:30 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[AVX]]></category>
		<category><![CDATA[Guest Blog]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[Sandy Bridge]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/04/13/performance-presentation-concepts-behind-parallel-computing-and-extended-cpu-instructions/</guid>
		<description><![CDATA[As you may have already read in a previous post called Personal Review of Intel Under-NDA Sandy-Bridge Event I held the last session in an Intel-Under-NDA event. The presentation was called Performance and covered the different aspects of parallel computing and also the new Sandy Bridge AVX Instructions. I have introduced this new feature in a previous post [...]]]></description>
			<content:encoded><![CDATA[<p>As you may have already read in a previous post called <a href="http://software.intel.com/en-us/blogs/2011/02/28/personal-review-of-intel-under-nda-sandy-bridge-event/">Personal Review of Intel Under-NDA Sandy-Bridge Event</a> I held the last session in an Intel-Under-NDA event. The presentation was called Performance and covered the different aspects of parallel computing and also the new Sandy Bridge AVX Instructions. I have introduced this new feature in a previous post called <a href="http://software.intel.com/en-us/blogs/2010/12/20/visual-studio-2010-built-in-cpu-acceleration/">Visual Studio 2010 Built-in CPU Acceleration</a>. The goal of the presentation is to provide a better perspective of all the new and advanced tools that Intel has provided in the last few years. The most important thing to do before you decide to use these new features is to understand the feature and understand how it applies to your application.</p>
<p>Although with a slight delay and as promised during the session and in my last post, this post and the few following it will contain selected slides with what you would have heard during the session if you were there. No NDA material exposed.</p>
<h1><span style="color: #3366ff;">Performance</span></h1>
<p>As always we should start with a few words about this presentation and its goals. The idea behind this presentation is to help you the audience understand the concepts behind parallel computing and extended CPU instructions. I know that I have a good presentation before me when I notice that writing the presentation and reviewing it helps me organize my thoughts. This means that the material is edited correctly and has a refined message which I myself have never seen before this point. Perspective is very important to me. It can be the difference between good architecture to an architecture which will have to be modified before the first release of the product. Too often I see people avoiding parallel programming because they cannot guarantee that they can pick the right path.</p>
<p>The presentation begins with the less technical slides so we get used to the graphics and the presenter's voice.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/04/Slide51.png"><img class="size-medium wp-image-33234 alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/04/Slide51-300x227.png" alt="Why Parallel" width="300" height="227" /></a></p>
<p>Going back to the 1970's there were IC chips (Integrated Circuit) that worked about a few MHz. This was new because it meant that it was possible to dispatch data very fast compared to physical switching. By dispatching data I mean a single bit and up to several bits. This technology called <a href="http://en.wikipedia.org/wiki/Transistor-transistor_logic">TTL</a> demanded that several chips work in parallel in order to get any work done because every chip had its own dedicated functionality. If RS232 communication required XOR operations for data integrity then there had to be a XOR TTL chip on the board. A Diskette Drive using XOR for data verification needed another XOR TTL chip. Processors were expensive at the time and did very little.</p>
<p>Beginning of the 1980's Intel releases a new generation of CPU chips which had 16 bit address bus, and 16 bit internal data bus. The low price (and a few other factors) made it the main processor for a home PC. It was still very slow so there were assisting chips on the board such as DMA, communication chips, etc. but now the CPU could perform integrity check for RS232 communication without the need to a dedicated chip. Slowly but surely the CPU became more and more powerful allowing floating point instructions and Packed Operations, first as an external co-processor and eventually as an internal component. With this the need of 'smart' chips working side by side with the CPU has reduced and if we compare the complexity and power of the CPU vs. its peripherals from 1980 till 2000 there is a rapid decline and peripherals today are much slower than the main CPU whereas 1980 PC might have had device faster than the CPU helping it with the hard work.</p>
<p>There two main reasons for CPU becoming much stronger than the peripherals. First is that it is so simple to increase CPU clock by improving silicon technology. The faster the CPU the less help it needs from other peripheral devices. This causes the main CPU to take the roles of many devices and reduce the peripheral functionality to <a href="http://en.wikipedia.org/wiki/OSI_model">OSI model</a> 'layer 2' only. The second reason for CPUs to take over is that it is simpler to design and redesign software more than hardware and whereas hardware may become obsolete, using software maintains compatibility over years and different board manufacturers.</p>
<p>One day we wake up and find out that if we keep increasing CPU speed we need new cooling technology invented. Such technology is possible but eventually the CPU speed will reach a critical point at which it will no longer be possible to cool it. On another track and with no relation to this, network cards began offering 'offloading' features which means that the network card provides Layer 2 processing and also Layer 3 and even Layer 4 functionality. Graphic cards also started providing advanced hardware acceleration features. USB Bus is also performing Layer 3 processing in hardware.</p>
<p>We find ourselves with many smart peripherals again. This is the result of the fact that technology has advanced to the point when peripherals today may have more processing power than an old Pentium processor and with the fact that CPU speed has reached its critical point. We had CPUs at 4GHz and then went back to 3Ghz and below.</p>
<p>All these bring us back to the parallel world:<br />
* We can no longer buy a new CPU and expect it to make our software work faster just because it is a newer CPU<br />
* Peripherals today are more powerful than a CPU was 10 years ago<br />
* The Internet has re-invented distributed systems and scalability does not stop with a single machine</p>
<p>There are new tools and new libraries, new design patterns, new programming models and even new languages, all created or re-invented for parallel programming, all to assist us programmers with understanding this problematic area of parallel programming and solving this riddle.</p>
<p>This is all very nice and interesting but I have to tell you that a bigger question has been troubling me and I would really like to hear the answer for that:</p>
<p>Why would the original designers of systems base their designed on parallel operations? Parallel hardware is not an excuse good enough to justify Fork. UNIX had a built-in command called Fork. This command split a process into two separate processes. Actually the entire system design was based on it and all processes were Forked from the main system process. This automatically copied file handles, security attributes, etc. Why would the system designers of UNIX support Fork when the system was written using Assembly?! You don't add nice-to-have features in such scenarios. Moreover did you notice that the original user interface was a 'DOS' like text console that interacted with several processes in parallel? Why would users want that?</p>
<p>See? Something is wrong here. How is it possible that parallel design was a common practice back then if it is so impossible to understand? This just doesn't add up!</p>
<p>We can talk about how parallel programming is the future of computing and this does make an interesting talk for a coffee break but there is more to it. One of the most important aspects of parallel design is User Experience. User Experience is the product! It is a combination of two things: the User Interface which is the graphics and animations, and the Business Logic and how the application behaves. Have you ever clicked Print instead of Save and had to wait for 10 seconds for the Printer Selection dialog to appear so you can close it? This is bad User Experience. You will never find this in a good computer game. The difference from User Interface to User Experience is a result of a parallel design.</p>
<p>Learning that 1970's software designs were based on parallel methodologies is still surprising for me, even when I know that these new concepts came from the 1970's: Services, Web-Server, Cluster, Terminal Services (Remote-Desktop), Transaction, Distributed Computing, Cloud, Fork, Join, and a few others...</p>
<p>I will try to look into it in the following blog posts.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/04/13/performance-presentation-concepts-behind-parallel-computing-and-extended-cpu-instructions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Personal Review of Intel Under-NDA Sandy-Bridge Event</title>
		<link>http://software.intel.com/en-us/blogs/2011/02/28/personal-review-of-intel-under-nda-sandy-bridge-event/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/02/28/personal-review-of-intel-under-nda-sandy-bridge-event/#comments</comments>
		<pubDate>Mon, 28 Feb 2011 17:50:14 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Events]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[2nd Generation Intel® Core™ Processor Family]]></category>
		<category><![CDATA[AVX]]></category>
		<category><![CDATA[guest blogger]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[Sandy Bridge]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/02/28/personal-review-of-intel-under-nda-sandy-bridge-event/</guid>
		<description><![CDATA[Hi All, Early this month Intel held an event about the Sandy Bridge architecture and other near future developments. Attendees signed an NDA before entering the event and the material presented was really interesting to hear. The room was packed to no room with what looked to me like over 500 people sitting and listening. The [...]]]></description>
			<content:encoded><![CDATA[<p>Hi All,</p>
<p>Early this month Intel held an event about the Sandy Bridge architecture and other near future developments. Attendees signed an NDA before entering the event and the material presented was really interesting to hear. The room was packed to no room with what looked to me like over 500 people sitting and listening.</p>
<p>The event started with a few very detailed presentations about CPU internal behavior including detailed block diagrams and animations of data flowing between CPU Cores. Some of the information can be found here: <a href="http://software.intel.com/en-us/articles/sandy-bridge/">http://software.intel.com/en-us/articles/sandy-bridge/</a> and some of it I am afraid I cannot publicly reveal. We learned about future plans, how cores interact with each other under the Sandy Bridge architecture, and about the special place that RAM has which will allow faster data access with fewer bottle-necks between RAM and CPU Cores. The diagrams also covered how AVX and graphics are improved with this new architecture and how AVX advances SSE to the next level. See AVX and Media SDK under the <a href="http://software.intel.com/en-us/articles/sandy-bridge/" target="_blank">Sandy-Bridge</a> link above.</p>
<p>There were also details about CPU performance and how CPU manages Cores' clock speed. The new management is expected to reduce power consumption on one hand and increase overall performance and responsiveness on the other. I am not clear on whether this is public or not. I should however warn you that someone who attended the event is still trying to solve the problem with Windows' accurate performance counters (QueryPerformanceCounters API) which seems to be unpredictable now that every core has its own internal speed and the performance frequency (QueryPerformanceFrequency API) is no longer valid. This means that if your embedded / mobile / medical real-time application / driver is measuring time with accuracy that is better than 1 millisecond under Windows XP and Hyper-V Windows XP then you should verify that you have a solution for this. Maurice Zamir and I are still waiting for an answer after he sent it to selected people at Intel and I forwarded it to someone from Microsoft's kernel R&amp;D team. Currently a boot.ini switch should solve this but I am not clear on the implications. See the following Microsoft KB article about <strong>/usepmtimer</strong>: <a href="http://support.microsoft.com/kb/895980">http://support.microsoft.com/kb/895980</a>.</p>
<p>The second half of the day the group split into two parts: software and hardware. It was interesting for me to find that I know everyone presenting on the software track. Intel's session about Intel's tools was presented by Guy with whom I presented Parallel Programming at Microsoft's TechEd in Israel 2008. Another technical session was presented by someone from Intel's software development teams, and I remember meeting him two months ago when I was sent to Intel by Microsoft as a consultant, to help with the design of a kernel driver dealing with graphics (wasn't even mentioned at the under NDA-Event). There were also sessions about real-time systems including a session about Windows Embedded Presented by Microsoft Partner Account Manager.</p>
<p>My session was the last on the software track and as promised I will blog about the details of my presentation with selected slides included. Some of the demos where live coding demonstration of a previous post: <a href="http://software.intel.com/en-us/blogs/2010/12/20/visual-studio-2010-built-in-cpu-acceleration/">http://software.intel.com/en-us/blogs/2010/12/20/visual-studio-2010-built-in-cpu-acceleration/</a>.</p>
<p>Since I was presenting, I had the pleasure of a 'bring my wife to work' day and since we were both out of the house we also brought the youngest atendee to the event, my six months old son which btw. is the only attendee who did not sign an NDA so he is allowed to talk about all the secret content that was there.</p>
<p style="text-align: center;">
<div class="mceTemp" style="text-align: center;">
<dl>
<dt><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/02/IMG_6174-640x480.jpg"><img class="size-full wp-image-24618 " src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/02/IMG_6174-640x480.jpg" alt="an image of me and my son" width="448" height="336" /></a></dt>
<dd>"I wonder what's Visual Studio 2010..."</dd>
</dl>
</div>
<p style="text-align: left;">The following posts will detail the presentation's content.</p>
<p style="text-align: left;">Asaf</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/02/28/personal-review-of-intel-under-nda-sandy-bridge-event/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visual Studio 2010 Built-in CPU Acceleration</title>
		<link>http://software.intel.com/en-us/blogs/2010/12/20/visual-studio-2010-built-in-cpu-acceleration/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/12/20/visual-studio-2010-built-in-cpu-acceleration/#comments</comments>
		<pubDate>Tue, 21 Dec 2010 00:04:48 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Performance and Optimization]]></category>
		<category><![CDATA[Power Efficiency]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Asaf Shelly]]></category>
		<category><![CDATA[AVX]]></category>
		<category><![CDATA[compiler]]></category>
		<category><![CDATA[Game Development]]></category>
		<category><![CDATA[multi-core]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[simd]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[Visual Studio 2010]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/12/20/visual-studio-2010-built-in-cpu-acceleration/</guid>
		<description><![CDATA[Writing the sample code for this post I was amazed myself to see how simple it was to reach over 20 times performance improvement with so little effort.    The motivation is a very heavy video processing algorithm created for HD TV. This means hi-resolution which means many pixels to compute and it also means that [...]]]></description>
			<content:encoded><![CDATA[<p>Writing the sample code for this post I was amazed myself to see how simple it was to reach over 20 times performance improvement with so little effort.   </p>
<p>The motivation is a very heavy video processing algorithm created for HD TV. This means hi-resolution which means many pixels to compute and it also means that every object in a frame contains several times more pixels than regular video image. After having a C# working code (you can try it here: <a href="http://www.asyncop.com/clearimage/">http://www.asyncop.com/clearimage/</a>) there is a need for a C++ DirectShow codec implementation. The C++ implementation worked very well with a <strong>640</strong> x <strong>480 </strong>video resolution and 30 frames per second. This meant 640 x 480 x 30  =  9,216,000 pixels to process every second. On top of that the purpose of the algorithm is to produce a 'photoshop'-like effect by detecting important facial features and smoothing out non-important imperfections. This requires massive amounts of processing power. Moving towards HD means at least 1080 x 720 x 30  =  23,328,000 pixels to process, but it also means that every object in the image is made of approximately 4 times more pixels and every object is surrounded by more pixels. Processing power requirements increase dramatically.   </p>
<p>Here is an example of the output as generated by the less sophisticated C# online implementation:   </p>
<p>   <strong>Source:</strong>   </p>
<p><strong>      <a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/12/634191287412064000Src.png"><img class="alignnone size-full wp-image-22301" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/12/634191287412064000Src.png" alt="CandleLight Source Image" width="640" height="480" /></a> </strong>   </p>
<p>   <strong>Product:</strong>   </p>
<p><strong><strong>     </strong> <a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/12/634191287412064000Out.png"><img class="alignnone size-full wp-image-22302" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/12/634191287412064000Out.png" alt="CandleLight Product Image" width="640" height="480" /></a></strong>   </p>
<p>I will now survey the potential alternatives we considered with pros and cons for each possible alternative.   </p>
<p>When I became an Intel Black-Belt this year I received an impressive glass certification plate, licenses to Intel software products and a brand new laptop. Dell Latitude 6410 with Core i-7 running 64 bit version of Windows 7. My old laptop had a dual-core CPU with a slow hard-drive so I didn't even try to install Visual Studio 2010 on it. The new laptop is a completely different story. The new machine is running Visual Studio 2010 with no problems. In fact an image of my old laptop is running as Virtual PC inside my new laptop and it is executing much faster than the original device (easily done, email me and I'll tell you how).   </p>
<p>The C++ DirectShow implementation of the algorithm was first tested on this new Core i-7 machine. This finally allowed live encoding of a 640 x 480 video input but was far from enough for live encoding of HD ('beta' input) video. Here are the possible alternatives we have considered:   </p>
<p><strong><span style="text-decoration: underline;">Logical Optimizations of the code</span>: </strong>This means selecting the best effective internal resolution, color sensitivity, brightness compensation, linear adjustments etc. <strong>Reasoning: </strong>This is a way to eliminate pointless calculations and data processing which is not really relevant to the viewer. <strong>Pros: </strong>This will allow maximum performance for a given hardware / infrastructure. <strong>Cons:</strong> Requires great efforts to produce, high risk for bugs require long duration for QA, and overlooking a scenario is very costly which means a long design process. <strong>Product Quality:</strong> Very Good. <strong>Estimated Cost:</strong> 6 months.   </p>
<p><strong><span style="text-decoration: underline;">Parallel Software Design</span>:</strong> The algorithm is written sequentially. A good parallel design will allow best utilization of CPU Cores with minimal dependencies and inter-locking. <strong>Reasoning:</strong> Every CPU has multiple Cores. Making the most of the CPU cores can produce performance increase of several multiples. <strong>Pros: </strong>CPUs are expected to grow in the number of cores, which will allow performance improvements by simply buying new hardware. A good parallel design will also prepare the software for use on multiple machines for example as part of an HPC array. <strong>Cons: </strong>This process is sensitive for, and takes in account the usage of internal data structures. This means that every change in internal usage of data members and cache will result in redesign so this stage is the last stage of development, after the mathematics and all other features are fully closed. It is also possible that this will not be enough, and this solution has a resolution barrier per number of CPU Cores. <strong>Product Quality: </strong>Good, no as good as with logical optimizations. <strong>Estimated Cost: </strong>2 weeks for every fix to the algorithm.   </p>
<p><strong><span style="text-decoration: underline;">Distribution</span>: </strong>It is possible to take the implementation of the algorithm as is and distribute it with minimal parallel optimizations. <strong>Reasoning:</strong> The idea is that we use enough computers to get enough processing power even if the algorithm is not optimized at all. Higher resolution required? Just add more machines. <strong>Pros:</strong> Very easy to increase processing power. <strong>Cons:</strong> Requires the software to support distribution. Also it is possible that there is a barrier due to the massive amounts of data exchanged between machines. Latency is not guaranteed. <strong>Product Quality: </strong>Poor. Great increase in storage, power consumption, heat generated, system costs, and LAN infrastructure. <strong>Estimated Cost:</strong> 2 months of R&amp;D and dramatic increase in product costs.   </p>
<p>At this point we went back to the drawing board. We tried to think of a way to enter the market with what we have or with minimal R&amp;D efforts until we have a more concrete opportunity. We moved our efforts to the business arena, waiting for a good reason to invest in R&amp;D. As you can see above this is not a small effort. But then...   </p>
<p>Then I came across the Visual Studio 2010 CPU Acceleration support. The following steps are amazingly simple and fast to implement. Let's start by adding this option to the list:   </p>
<p><strong><span style="text-decoration: underline;">Visual Studio 2010 Built-in CPU Acceleration</span>: </strong>This option means that we try to get the best performance possible without having to go into the implementation details. <strong>Reasoning: </strong>Basically you can take my implementation of the algorithm and with minimal code modifications and project settings try to produce the most possible. <strong>Pros:</strong> Fast, simple to implement, out-of-the-box, very little chance for bugs and bugs are relatively simple to detect. <strong>Cons:</strong>This is not a real optimization to the code. The code will still perform redundant operations and the parallel optimization is far from optimal. <strong>Product Quality:</strong> As good as the latest implementation that we currently have. <strong>Estimated Cost:</strong> Faster than writing this blog post. (less than a day).   </p>
<p>To simplify the process I will now create a new project and write down the steps I am taking one by one. I will not use the real algorithm code. Instead I will take a simple implementation of an operation. A sample video frame of 1080 x 720 pixels is used as input. For every pixel we will calculate the square root, multiply the result by itself, and calculate the square root again, so pixel <em>P = sqr( P ); P = P * P;  P = sqr( P );</em>.   </p>
<p>Here is the simple code:</p>
<p><span style="color: #0000ff;">void</span> LoadImage(<span style="color: #0000ff;">float</span>* pixels);<br />
<span style="color: #0000ff;">const int </span>Width = 1280;<br />
<span style="color: #0000ff;">const int</span> Height = 720;<br />
<span style="color: #0000ff;">const int </span>SEC = 120;</p>
<p><span style="color: #0000ff;">float</span>* pixels = <span style="color: #0000ff;">new float</span>[Width * Height];<br />
LoadImage( pixels );</p>
<p><span style="color: #0000ff;">for</span>(<span style="color: #0000ff;">int</span> frames = 0; frames&lt;30*SEC; frames++)<br />
{<br />
   <span style="color: #0000ff;">for</span> (<span style="color: #0000ff;">int</span> y = 0; y &lt; Height; y++)<br />
   {<br />
      <span style="color: #0000ff;">int</span> Y = y * Width;<br />
      <span style="color: #0000ff;">for</span> (<span style="color: #0000ff;">int</span> x = 0; x &lt; Width; x++)<br />
      {<br />
         pixels[Y + x] = sqrt( pixels[Y + x] );<br />
         pixels[Y + x] = pixels[Y + x] * pixels[Y + x];<br />
         pixels[Y + x] = sqrt( pixels[Y + x] );<br />
      }<br />
   }<br />
}</p>
<p>The code generates a call to a CRT function that does the sqrt calculation. Actually there are several internal calls until the code returns. Total processing time for this is <strong><span style="color: #ff0000;">57.361 sec </span></strong>(57 seconds and 361 milliseconds). Here is the assembly generated. The assembly code is disorganised because this is compiled as <em>Release</em> and not <em>Debug</em> so that compiler optimizations are in effect.   </p>
<p><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;">01231055 fld dword ptr [esi]<br />
01231057 <span style="color: #0000ff;">call _CIsqrt</span> (1231910h)<br />
0123105C fstp dword ptr [ebp-0Ch]<br />
0123105F fld dword ptr [ebp-0Ch]<br />
01231062 <span style="color: #0000ff;">fmul </span>st(0),st<br />
01231064 fstp dword ptr [ebp-0Ch]<br />
01231067 fld dword ptr [ebp-0Ch]<br />
0123106A <span style="color: #0000ff;">call _CIsqrt</span> (1231910h)<br />
0123106F fstp dword ptr [ebp-0Ch]<br />
01231072 fld dword ptr [ebp-0Ch]<br />
01231075 add esi,4<br />
01231078 dec edi<br />
01231079 fstp dword ptr [esi-4]<br />
0123107C jne main+55h (1231055h)</span></span></span></p>
<p><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></p>
<p><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="color: #000000;"><strong><span style="text-decoration: underline;">First optimization: Open MP</span></strong></span></span></span></span><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span> </p>
<p><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="color: #000000;">Go to <em>Project Properties / Configuration Properties / Language </em>and modify "Open MP Support" to "Yes"  </span> </span></span></span></p>
<p><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="color: #000000;">Here is the modified code:</span></span></span></span><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></p>
<p><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="color: #000000;"><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;">for</span></span></span><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;">(</span></span><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;">int</span></span></span><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;"> frames = 0; frames&lt;30*SEC; frames++)<br />
{<br />
</span></span><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><strong>   #pragma </strong></span></span></span><strong><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;">omp parallel<br />
</span></span><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;">   #pragma </span></span></span><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;">omp </span></span></strong><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><strong>for<br />
</strong></span></span></span><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;">   for</span></span></span><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;"> (</span></span><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;">int</span></span></span><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;"> y = 0; y &lt; Height; y++)<br />
   {<br />
</span></span><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;">      int</span></span></span><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;"> Y = y * Width;<br />
</span></span><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;">      for</span></span></span><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;"> (</span></span><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;">int</span></span></span><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;"> x = 0; x &lt; Width; x++)<br />
      {<br />
        pixels[Y + x] = sqrt( pixels[Y + x] );<br />
        pixels[Y + x] = pixels[Y + x] * pixels[Y + x];<br />
        pixels[Y + x] = sqrt( pixels[Y + x] );<br />
      }<br />
    }<br />
}</span></span></span></span></span></span></span></span></p>
<p><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="color: #000000;"><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="color: #000000;"><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;"><span style="color: #000000;">All four CPU Cores are continuously busy and we get a time of <strong><span style="color: #ff0000;">32.276 sec</span></strong>. If we optimized the code (as mentioned in the section above) we should be closer to 1/3 of the time or even 1/4 of the original processing time.</span></span></span></span></span></span></span></span><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="color: #000000;"><span style="font-family: Verdana; font-size: x-small;"> </span></span></span></span></span></span></span><span style="color: #000000;"> </span></span></span></span></p>
<p><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="color: #000000;">OpenMP is simple and most of you have already heard of it before. Now here is something even simpler that would probably surprise you.</span></span></span></span></p>
<p><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="color: #000000;"> </span></span></span></span></p>
<p><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="color: #000000;"><strong><span style="text-decoration: underline;">Second Optimization SSE2</span></strong></span></span></span></span></p>
<p><span style="color: #000000;">Instead of adding SSE support to the OpenMP version I will revert to the original code (before OpenMP) and add SSE support. SSE stands for <em>Streaming SIMD Extensions</em>, where SIMD is <em>Single Instruction Multiple Data</em>. There are several versions of these extensions, the latest today called AVX. The idea is that most CPU (assembly) operations operate on a single data variable either float, byte, word, double word, etc. SIMD operations started with MMX and allow for example incrementing the value by one for eight variables with a single CPU Instruction. This works by using a special CPU Register which can store more than a single variable. Here is the procedure:</span></p>
<p><span style="color: #000000;">Go to <em>Project Properties / Configuration Properties / Code Generation </em>and modify "Enable Enhanced Instruction Set" to "Streaming SIMD Extensions 2".</span></p>
<p><span style="color: #000000;">This will activate the SIMD feature in the compiler. Rebuilding the application will produce a new code. Now CRT functions are replaced with single CPU Instructions. We are back to the original code. Now the assembly code looks like this:</span><span style="color: #000000;"> </span><span style="color: #000000;"> </span></p>
<p><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="color: #0000ff;">00D71055 movss xmm0,dword ptr [eax]<br />
00D71059 cvtps2pd xmm0,xmm0<br />
00D7105C sqrtsd xmm0,xmm0<br />
00D71060 cvtpd2ps xmm0,xmm0<br />
00D71064 cvtss2sd xmm0,xmm0<br />
00D71068 mulsd xmm0,xmm0<br />
00D7106C cvtpd2ps xmm0,xmm0<br />
00D71070 cvtss2sd xmm0,xmm0<br />
00D71074 sqrtsd xmm0,xmm0<br />
00D71078 cvtpd2ps xmm0,xmm0<br />
00D7107C movss dword ptr [eax],xmm0<br />
</span>00D71080 add eax,4<br />
00D71083 dec ecx<br />
00D71084 jne main+55h (0D71055h)</span></span></span></span><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></span><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></span></p>
<div><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></span></div>
<div><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></span></div>
<div><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></span></div>
<div><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></span></div>
<div><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></span></div>
<div><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></span></div>
<div><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></span></div>
<div><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></span></div>
<div><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </span></span></span></span></div>
<p><span style="color: #000000;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"><span style="font-family: Verdana; color: #808080; font-size: x-small;"> </p>
<p></span></span></span></span></p>
<p><span style="color: #000000;">The part of code that remained the same belongs to the <em>for</em> loop. We did zero modifications to the code. Now let's see how the new code behaves. The original version was <strong><span style="color: #ff0000;">57.361 sec</span></strong>. Running the same application with SSE2 turned on resulted in the amazing time of <strong><span style="color: #ff0000;">15.320</span></strong>!</span></p>
<p><span style="color: #000000;">Microsoft Visual Studio 2010 can also automatically generate AVX code. For this you will need to cancel SSE2 support by setting to "Not Set" and go to <em>Project Properties / Configuration Properties / Command Line </em>and manually add in "Additional Options" to "/arch:AVX".</span></p>
<p><span style="color: #000000;">You can find out more about these commands here: <a href="http://software.intel.com/en-us/avx/">http://software.intel.com/en-us/avx/</a></span></p>
<p><span style="color: #000000;"> </span></p>
<p><span style="color: #000000;">But wait... there is even more! The compiler does not support all commands automatically. What we just did above was to ask the compiler to use the CPU acceleration for square root computation. The CPU can also perform this operation on several floating point variable at the same time, using the same CPU Core.</span></p>
<p><span style="color: #000000;"><strong><span style="text-decoration: underline;">SSE Packed Operations</span></strong> </span></p>
<p><span style="color: #000000;">We add to our code the support for CPU Intrinsic functions by adding the header </span><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;">#include </span></span></span><span style="font-family: Verdana; color: #a31515; font-size: x-small;"><span style="font-family: Verdana; color: #a31515; font-size: x-small;"><span style="font-family: Verdana; color: #a31515; font-size: x-small;">&lt;emmintrin.h&gt;</span></span></span><span style="color: #000000;">.</span></p>
<p><span style="color: #000000;">Now we modify the code so that the buffer of pixels which is an array of floating point values is used as an array of <em>Packed Floating Point</em> values. Packed means that there are several different and unrelated values packed into a single variable. Here is the modified code:</span></p>
<p><span style="color: #0000ff;">for</span>(<span style="color: #0000ff;">int</span> frames = 0; frames&lt;30*SEC; frames++)<br />
{<br />
    <span style="color: #0000ff;">for</span> (<span style="color: #0000ff;">int</span> y = 0; y &lt; Height; y++)<br />
    {<br />
        <span style="color: #0000ff;">int</span> Y = y * Width;<br />
        <span style="color: #0000ff;">for</span> (<span style="color: #0000ff;">int</span> x = 0; x &lt; Width;<span style="color: #ff9900;"> <strong>x += 4</strong></span>)<br />
        {<br />
            (*(<span style="color: #0000ff;">__m128</span>*)(pixels+Y+x)) =<span style="color: #339966;"> <strong><span style="color: #0000ff;">_mm_sqrt_ps</span></strong></span>(*(<span style="color: #0000ff;">__m128</span>*)(pixels+Y+x));<br />
            (*(<span style="color: #0000ff;">__m128</span>*)(pixels+Y+x)) = <span style="color: #0000ff;"><strong>_mm_mul_ps</strong></span>(*(<span style="color: #0000ff;">__m128</span>*)(pixels+Y+x), *(<span style="color: #0000ff;">__m128</span>*)(pixels+Y+x));<br />
            (*(<span style="color: #0000ff;">__m128</span>*)(pixels+Y+x)) = <span style="color: #0000ff;"><strong>_mm_sqrt_ps</strong></span>(*(<span style="color: #0000ff;">__m128</span>*)(pixels+Y+x));<br />
        }<br />
    }<br />
}</p>
<p><span style="color: #000000;">The type <span style="color: #0000ff;">__m128</span> is recognized by the compiler as packed type of four float values. Now the CPU will operate on four pixels per instruction instead of one so we advance the <em>for loop</em> variable <strong>x</strong>by 4 instead of 1. You can learn more about these Intrinsics functions by pressing F1 in Visual Studio 2010 or by searching MSDN Library. There is a full listing of supported operations.</span></p>
<p><span style="color: #000000;">When we execute the code now we get the amazing result of <strong><span style="color: #ff0000;">3.729 sec</span></strong>!</span></p>
<p><span style="color: #000000;">Adding OpenMP support reduces this even more and down to <strong><span style="color: #ff0000;">2.246 sec</span></strong>. Such a reduction means that only two CPU Cores are enough so we can tell OpenMP to use only two cores and we still get the same results, clearing up two cores for other applications. (We do this by using </span><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;"><span style="font-family: Verdana; color: #0000ff; font-size: x-small;">#include </span></span></span><span style="font-family: Verdana; color: #a31515; font-size: x-small;"><span style="font-family: Verdana; color: #a31515; font-size: x-small;"><span style="font-family: Verdana; color: #a31515; font-size: x-small;">&lt;omp.h&gt;</span></span></span><span style="color: #000000;"> and calling </span><span style="font-family: Verdana; font-size: x-small;"><span style="font-family: Verdana; font-size: x-small;"><em>omp_set_num_threads(2);</em></span></span><span style="color: #000000;"> before our code).</span></p>
<p><span style="color: #000000;"> </span></p>
<p><span style="color: #000000;"><strong><span style="text-decoration: underline;">Conclusion</span></strong></span></p>
<p><span style="color: #000000;">Instead of doing massive code redesign and optimization, which requires good understanding of the logic behind the code and is very bug prone, we can change a few compiler switches and modify C Runtime (CRT) Library functions without even understanding what the code does and get a performance increase which reduced in the example above the execution time from <strong><span style="color: #ff0000;">57.361</span></strong> seconds to <strong><span style="color: #ff0000;">2.246 </span></strong>seconds. This can be achieved very fast by a programmer who has very little idea of the code implementation.</span></p>
<p><span style="color: #000000;">Fast and simple code changes resulted in 25 time increase in performance and two CPU Cores free for work. This means the equivalent of a cluster of <strong>50</strong> machines!</span></p>
<p><span style="color: #000000;">You were looking for parallel programming to help your application perform? Don't underestimate AVX!</span></p>
<p><span style="color: #000000;"> </span></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/12/20/visual-studio-2010-built-in-cpu-acceleration/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How Multi-core increases Software Reliability</title>
		<link>http://software.intel.com/en-us/blogs/2010/11/06/how-multi-core-increases-software-reliability/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/11/06/how-multi-core-increases-software-reliability/#comments</comments>
		<pubDate>Sat, 06 Nov 2010 22:26:43 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Mobility]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Performance and Optimization]]></category>
		<category><![CDATA[Enterprise Architecture]]></category>
		<category><![CDATA[guest blogger]]></category>
		<category><![CDATA[multi-core]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/11/06/how-multi-core-increases-software-reliability/</guid>
		<description><![CDATA[There are several aspects of software reliability. One aspect for example can be detected by the compiler by doing simple type checking. On the complete opposite side there is business logic and its faults. This is what we really want to focus on. Business logic is the hardest part to debug and verify. The rain-forests [...]]]></description>
			<content:encoded><![CDATA[<p>There are several aspects of software reliability. One aspect for example can be detected by the compiler by doing simple type checking. On the complete opposite side there is business logic and its faults. This is what we really want to focus on.</p>
<p>Business logic is the hardest part to debug and verify. The rain-forests lost a branch or two for the documentation created to test and verify business logic. Sometimes the reliability of business logic is just nice to have, sometimes it is important for a company not to release failing software but there are situations where you just cannot afford to release a bad product for example the software managing a medical device.</p>
<p>Statistically it is almost impossible to find all bugs in a system. The application has many states and many flows and for each of these there are many possible values for parameters. The branching is infinite. Testers take the worst cases, the extreme cases, the most common scenarios, etc. If your device is a smart-card storing user's allergy information and you have a Flash-write bug every 100,000,000 times it will be almost impossible to detect with any testing protocol. On the other hand if you have 1,000,000 users and the device is saving data every day then you have a few people with damaged data every month. This is also valid for the PC side software analyzing the device's data.</p>
<p>PC software can have the same statistics as any other software but on the PC side we have better tools for testing and verification so you might think that PC applications have an easier life... except that:</p>
<ul>
<li>Everybody knows that PC software has automated solutions that do a good job covering the majority of common cases, so there isn't really any point in inventing the testing wheel.</li>
<li>The PC software is using tested and verified components and we are using the latest edition of Intel Processor with the most expensive server money can buy, so what can go wrong?</li>
<li>The product is a smart-card device so the business focus is on smart-card devices. Most efforts go in that direction including patents, strategy, finances, etc. The PC application is less important for the business so it might even be out-sourced to a different company or even a company in a different country.</li>
<li>Device software developers and firmware programmers usually pay more attention to certain details than PC software developers. First and foremost because of the infrastructure: PC applications usually throw exceptions when something is wrong whereas Firmware has only few possible return values. Tell you the truth, after a while you get used to it. When I am writing a device driver I keep an eye on everything but when I write a .Net C# application I trust the system to throw an exception of an unknown type, one that I will never be able to predict, so I might as well let it just blow...</li>
</ul>
<p>Let's take the worst case scenario. You might be surprised (or not) to hear that this is also the most common scenario. Here it is:</p>
<p>Our company makes smart-card devices. We don't really care about the PC application. Our investors didn't pay us to write PC applications and we don't see any business value in holding such intellectual property. Our team is made of smart-card experts not PC experts so we might as well give the PC software task to a company who can guarantee results without having to waste our R&amp;D efforts.</p>
<p>We took a company in Romania to do data collection from our device, store it in a database and then present it to the user when required. Then we found out that they were having problems with the data analysis algorithms. This was not part of the original requirement but our competition already have the feature working so we have to support it. It looks like they are database experts but know nothing about algorithms. We had four candidates for the algorithm and eventually decided to go with the German team.</p>
<p>After four months of work we had a working solution and sent it to the Romanian company for integration and testing. Everything looked fine until we got to the limited launch for a group of 5,000 uses. It looks like there are some errors in the diagrams which we cannot attribute to any input data.</p>
<p>It is possible that the algorithm does not cover all cases correctly by design. It is possible that there was a error coding the algorithm even though the mathematics works fine on paper, for example someone used an <strong>int</strong>(integer) instead of an <strong>unsigned int</strong>. It is also possible that there is a memory access violation if the algorithm is using a buffer of exactly 1024 bytes and there was another buffer right after it used by the GUI thread at that time. Whichever the case, most companies would just give up and accept the fact that they have a faulty product. Maybe look for someone else to do the work for another six months, maybe pay someone to verify the algorithm, pay someone to code-review it... Any solution is not good enough for this year's financial graphs.</p>
<p>Such things usually shock a company, delay finances, increase product costs by several folds, delay release dates, kill the product when it is just starting with the market, causes the company to make some internal changes, the investors to demand changes, and so on.</p>
<p>If you could take a look at the German company writing the algorithm you would have found a single person who is an expert with smart-card data analysis and experience of over 20 years in the field working for major companies first as an employee and then as a freelance. This single person has all the knowledge and does all the coding. Most times you don't need to be an expert in multi-threading to provide such solution so the algorithm does the internal business logic serially. This is perfectly acceptable because it is taking no more than one second to evaluate all the data on the device and one second is what the product teams decided as acceptable for the user.</p>
<p>If we went a head with any other algorithm provider it would have probably been the same situation, with a small exception: the bugs found in each implementation of the algorithm are different. So for example it is possible that one implementation would provide bad data when date range is exactly 21 days and 6 hours, and another would repeat the last result if the input value is exactly -5,475.</p>
<p>How about dramatically reducing the risk beforehand by asking all four companies to provide their algorithm? we can test each algorithm for validity after we have it and we can replace one algorithm with another if we find a critical bug.</p>
<p>How about using all four implementations at the same time?</p>
<p>Run all four implementations of the algorithm in parallel. A Multi-core CPU means no performance cost so we can maintain the one second limit even with four different instances all running in parallel. We can then compare the results in a kind of voting. If three implementations gave a result then it is the correct one. An implementation that gave a result different from all others will be marked as a bug and ignored. It is also possible to identify the mistake. With a single implementation of the algorithm it takes an expert to detect such data integrity errors. Most importantly such bugs are now low priority, not critical. We have a medium-priority bug only if the voting is even 2 vs. 2, and even then we know that the data is damaged. Only critical error is when 3 or more implementations produced the same erroneous value which is extremely rare unless something is really wrong with the theory behind the algorithm.</p>
<p>When it comes to medical devices, automotive industry, heavy machinery, military devices, and many others, software reliability means life or death. If the device tells your doctor that you received 0.01 cc of pink injection after the car accident instead of the real 0.1 cc that you received, well... let's just say that software saves life because doctors rely on it and consider it dependable.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/11/06/how-multi-core-increases-software-reliability/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Design Your System To Automatically Test Itself</title>
		<link>http://software.intel.com/en-us/blogs/2010/09/24/design-your-system-to-automatically-test-itself/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/09/24/design-your-system-to-automatically-test-itself/#comments</comments>
		<pubDate>Fri, 24 Sep 2010 16:24:11 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Enterprise Architecture]]></category>
		<category><![CDATA[multi-core]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[Service Oriented Architecture]]></category>
		<category><![CDATA[SOA]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/09/24/design-your-system-to-automatically-test-itself/</guid>
		<description><![CDATA[In average programmers write only 10 lines of code per day and that includes copy-pasting from the Internet. Sure, every system can be designed using just four blocks and every programmer can write over 1000 lines of code every day. What's stopping us from working so fast are the bugs. Beginner programmers write the code very fast, then things [...]]]></description>
			<content:encoded><![CDATA[<p>In average programmers write only 10 lines of code per day and that includes copy-pasting from the Internet. Sure, every system can be designed using just four blocks and every programmer can write over 1000 lines of code every day. What's stopping us from working so fast are the bugs.</p>
<p>Beginner programmers write the code very fast, then things break and you go back to debug the problems as they appear. Soon enough you learn to test for bugs before you release your product. At that point you learn that you cannot test everything, it is hard to simulate the real-world environment, and the worse part is that adding new features adds new bugs. Some of these new bugs cause older and tested features to break again.</p>
<p>There are several approaches we take:</p>
<ul>
<li><strong>Built In Test (BIT) or Self Test</strong>: Before the system is started we want to make sure that things are really as we assume in our code. For example that the RAM is not damaged, our database does exist and it is not empty or corrupt etc. This way we start our system on a controlled environment</li>
<li><strong>Automated Testing</strong>: We write a collection of scripts that run over features in our application. If something breaks we can find it on the next test. These tests are written side-by-side with the code. In a way it means writing the code twice in two different languages. Once telling the computer what to do and second is describing what should have happened. First is usually object oriented code and second is commonly procedure based.</li>
<li><strong>Non-Automated Testing</strong>: We have a person sit in front of the machine and perform tasks that users are expected to perform. This is mainly for testing a flow in the application and the higher level business logic.</li>
<li><strong>Code Level Verification</strong>: Coders need to verify parameters, pointers, handles, catch exceptions, etc. This means verifying the behaviour of internal software modules according to the Interfaces as described by the system design.</li>
<li><strong>Application Level Verification</strong>: The application maintains different states of operation and different phases in the execution flow. For example: You cannot unload the application until all communication has completed. You don't expect a 'cancel' request if there isn't any ongoing operation. A Deadlock is also handled at application level because the protected resource is damaged.</li>
</ul>
<p>All the methods above require high degree of understanding of the logic behind operations. In other words the programmer writing these tests must understand the Interfaces and know limitations of other components in the system. For example if you want to use printf(...) you must know that sending a NULL pointer will cause the application to generate an exception so you have to either verify the pointer or catch the exception. There is a reason why we have more  than a single programmer coding the entire system, such as professionality - you are a database expert and I am a media expert, skill - an expert will code the drivers and the less experienced will code the test application, and so on. A direct result is that most programmers cannot know all the limitations of the infrastructure that they are using. For example you may test printf(...) for a NULL parameter but can you make sure that it is not exceeding a single line on screen?</p>
<p>The display driver knows if the text exceeded a single line. Has anyone ever thought of testing? Who is to make sure that the text inside a button will not wrap around?</p>
<p>Automated tests are written by developers. These tests verify the system externally for example by entering different user inputs and testing the end result on screen. You have to know the application very well in order to write an automated test. You have to find your way through to activate the correct window and then send the input that might cause the application to misbehave. Misbehaviour is a very wide word and documents containing many pages are written for every application in order to define it.</p>
<p><strong>Here is an alternative:</strong></p>
<p>Every application has 3 fundamental layers: Input, Output, and Data Manipulation. Another representation is:</p>
<p>- User Interface: The Interface that the application is exposing and the services that it is providing.</p>
<p>- Business Logic: What the application actually does. This is some data manipulation, data cache, or data generation.</p>
<p>- Infrastructure Interface: The collection of Interfaces and services which are used by the application.</p>
<p>In a way every application is translating data from input to output.</p>
<p>The title says that we can automatically generate automated tests. This is most relevant for Task Oriented systems in which tasks are chained from input to output. Object Oriented design suffers in this aspect since it is too easy to have multiple objects all receiving services from each other. That is why the pattern I am about to show you is becoming relevant for us.</p>
<p>We can start with <strong>a simple example:</strong></p>
<p>The user is selecting a textbox in my application, typing a few words and pressing <em>Enter</em>. As a result the GUI layer will send the event "New Text" on field ID "Email Address" to the User Registration layer. This layer understands the meaning of the given field and will add this to the rest of the user details that were entered. Finally when the user clicks "Submit" all the user details are sent to the next layer - the database manager. The database manager will verify the information and save it to the database if correct or reply with an error code. If the save was successful the database manager will open up a dialog with all the details and a success icon.</p>
<p>Verifying the application requires understanding of this behaviour. For example you can use an empty email address, or an address that has two @ characters. What about an email address which is too long?</p>
<p>We enter bad input and expect the process to fail. If we enter an email address which is too long then the success dialog might exceed the length of a single line and wrap around. What happens when we add language support? Different letters have different widths. Success dialog might have far less space in German that it would in English for the input of</p>
<p>"Street Address:" _______________ because in German there might be room for less characters left after</p>
<p>"Straße und Hausnummer:" _______</p>
<p><strong>What we can do:</strong></p>
<p>The Success Dialog knows when there was not enough room for the text. If this dialog can report a detailed warning when something like that happens then we can know that there is a problem. In this case the dialog is the output. The input is from the User Registration component. Just as the Success Dialog is the only one to know how the text is displayed, the User Registration component knows what might be an illegal information, long data, and other inputs that need verification. The programmer who is writing the User Registration layer knows what is not legal according to the Interface and what is not expected to be commonly used, for example <a href="mailto:test@@@.com">test@@@.com</a> is illegal and <a href="mailto:#@test.com">#@test.com</a> is legal but is not commonly expected.</p>
<p><strong>Rules are:</strong></p>
<p>* Every layer in the system can report a warning when something unexpected happened, which is non-fatal. For example:  The user tried to save a document but the file name was rejected by the operating system. The layer can for example replace a single -<strong>"-</strong> character with two -<strong>'- -'-</strong> characters.</p>
<p>* Every layer in the system can generate illegal or extreme-condition data for the next layer. For example the mouse driver can generate triple clicks, fast mouse moves, multiple drag operations, etc.</p>
<p>* Evey operation has an ID in the system, with originator, target, warnings, completion status, etc.</p>
<p>When the application is running it will log any warning generated from any layer. The system can enter Test Mode in which every layer can in its turn generate illegal and extreme data for the next layer.</p>
<p>The advantage is clear. For every module the programmer coding the business logic is also the one to verify the warnings and the one to generate testing outputs. When you are writing the module you know exactly what it can and can't do. Don't add a comment that nobody is going to read that says "Be careful, we allow # signs as part of an email address because it is legal, but some servers will reject it. Please verify". Instead you can write the code that emits such addresses, whether from a list or random generated. If you don't write this generation side by side with the business logic then you will probably forget the internals which are so important for such outputs.</p>
<p>Such output generation is simple to code and therefor is also simple to review. Now if you change your system from English to German you can enter automated testing mode to find out for example that someone hard-coded a patch to bypass a case it which the street name is over 15 characters and the number is over 10000, even if they did add a note in the code...</p>
<p>As a general rule a good system design will allow all automated tests to be generated automatically. If this is not the case then your design is flawed or incomplete. Every test that cannot be generated by a component or layer in the system usually indicates that two or more layers or components have been fused together, usually because of an old design which was extended over the years.</p>
<p>Let me know how this works for you.</p>
<p>Asaf</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/09/24/design-your-system-to-automatically-test-itself/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallel Architecture: From Adhocracy To Bureaucracy</title>
		<link>http://software.intel.com/en-us/blogs/2010/08/19/parallel-architecture-from-adhocracy-to-bureaucracy/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/08/19/parallel-architecture-from-adhocracy-to-bureaucracy/#comments</comments>
		<pubDate>Thu, 19 Aug 2010 23:11:17 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[architecture]]></category>
		<category><![CDATA[Enterprise Architecture]]></category>
		<category><![CDATA[guest blogger]]></category>
		<category><![CDATA[hpc]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[Service Oriented Architecture]]></category>
		<category><![CDATA[SOA]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/08/19/parallel-architecture-from-adhocracy-to-bureaucracy/</guid>
		<description><![CDATA[These past few weeks I have been busy to the extreme. Some good events and some are less happy events. The most important of which is having our first baby boy. You can really test your time management skills when you have no time... A wise man once told me that there is always time: [...]]]></description>
			<content:encoded><![CDATA[<p>These past few weeks I have been busy to the extreme. Some good events and some are less happy events. The most important of which is having our first baby boy. You can really test your time management skills when you have no time... A wise man once told me that there is always time: 24 hours per day. Having enough time depends of what you plan to do with it. People can't work 24 hours a day. There is something else other than time, things like concentration, focus, commitment... These depend on how interesting is your job and what else would you like to do. If you really try you might be able to put that aside and still invest enough efforts in your work. There is another thing which is more interesting for this discussion: Interruptions.</p>
<p>We called our baby boy Gad. Gad is one of the sons of Jacob in the bible (<a href="http://en.wikipedia.org/w/index.php?title=Gad_(Biblical_figure)&amp;redirect=yes" target="_blank">See wikipedia</a>), he was number 7 in the list of 12. In Hebrew the meaning of the name Gad is luck and when adding the value of the Hebrew letters we get 7 (<a href="http://en.wikipedia.org/wiki/Gematria" target="_blank">Gimatria</a>). We are fortunate to have him. He is a good boy at the age of 3 and a half weeks. (Here he is at half that age: <a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/08/GadShelly.wmv">GadShelly</a>). We didn't name him Gad because of good luck. Everyone around us are giving special and unique names to their kids, we figured it would be best to give him the simplest name we can find and let him cast his own character into this name.</p>
<p>Good boy or not, he is always seeking attention. A month ago my cellular phone was interrupting me when I was trying to put all my focus into something. If right now while I am writing these lines I had a phone call, I would probably not even get up to see who this is. I'll get back to them. Everyone who knows me already know to call me from a familiar number so I can get back to them. Before I sat down to write these lines I made myself a cup of tea. It is cold by now because I found myself going over emails and then writing this post. I have been wanting to get up to do something about this cup of tea, and it is still on my to do list. If I get up for something else hopefully I will remember to take the cup with me. I did however get up immediately when Gad was making a few noises. Love and care had nothing to do with it. There is something else there.</p>
<p>Babies are very demanding but this is not because they would interrupt more times or more rapidly. There are days when my phone would ring too often for me to do any work if I answered it. People came back from vacation, end of year and everyone need to clean things up etc. My cell phone plays a music I like. Calls from family members can have different a music. This is just like the CPU having Interrupts with several different priorities. This priority is according to the source of the event. If I want to know the urgency of the event I would have to pick up the phone and ask, or engage in a higher level protocol, just like a PCI Interrupt can be a mouse click or an out-of-battery event. If you call me two consecutive times I figure this has to be important so I will probably pick it up. If it wasn't then the next time you call me two consecutive times I will ignore...</p>
<p>Babies have a very low tolerance level. A baby can sleep for an hour and then wake up, look around and say "aa", "ooa", "we-a". This means something like "I am awake", "Food", "Food". If you do not respond then immediately they would start crying which means "Help me!",  "Help me!". If you do not respond to this then ten seconds after this the crying escalates to  "Emergency!",  "Emergency!". If you responded in time then the crying stops, as long as things are going in the general direction. When the baby figures out "Hey, this guy is just picking me up, where is the food?" we go back to escalation.</p>
<p>Right after child birth, we were at the hospital for two more days, and Gad was with us the whole time. They let you keep the baby with you if you want. I was with my wife so we took shifts taking care of him. For anyone who would prefer to get some sleep, the hospital has nurses in a nursery with which you can leave the baby and come back for feeding whenever you want. We were in that room for our baby's physical tests. A long and wide room full of tiny little babies. Statistically you would have 2 to 6 babies crying at any given time. Taking care of them are 3 to 5 nurses. Some of them do physical tests to other babies so they are 'out of resources' to handle every baby's  "Emergency!" call. It seems that most times babies would call  "Emergency!"  "Emergency!" and a minute later would forget about it and go back to sleep, or maybe they just get tired and reserve their energy for something else.</p>
<p>Some mothers ask the nurses to call them when the baby is crying, so if a baby is crying for long enough and there is a free nurse the mother would get a call. This makes the nurse teams into a dispatcher. Actually they are a collection of dispatchers with different tasks and responsibilities. If the mother didn't come to breastfeed for a long time, the nurses would feed the baby using a bottle. In this system every one of the clients define their event as hi- priority and the server decides who is really hi- priority by other parameters such as persistence, and understanding client's internals.</p>
<p>Imagine a web server which is now out of resources, flooded with requests. Some requests have to be ignored. Simple requests such as "give me your home page" can be answered because they are cheap to handle. A request for "Get all my friends list" requires some database action and the database is too busy. The request is ignored and the client will have a "refresh my list" button enabled so the user can ask again later. When a client is saving a document, this request will not give up so easily and the connection will remain open for up to 120 seconds. The web server will see a persistent connection and will figure that this has to be important if the client is willing to waste time for the action to complete. This describes a system which by design states that if an event is really important then the client would invest resources and efforts on making sure it is served. Do you know of such a system?</p>
<p>This sounds like a good plan for a system. Clients are eventually operated by people. If you have an HPC Cluster and the user can set the priority of his own tasks... well... In modern Hebrew there is a saying which loosely translates to The cook does not testify for his bakery. Maybe some student has a hot date and wants his tasks to finish first and that would slow down other tasks which are more important in the bigger picture. Priorities are dependant on the source of the request, but persistence and the amount of effort you are willing to invest in a request is also not a really good factor. It is possible that a young kid playing an online computer game be more persistent than a CEO of a big company trying to access his online bank account.</p>
<p>There are many systems who deal with a variety of requests from different people. Most of these systems are run by people and they make the final decision but most times there are clear-cut rules to what should and should not happen and how to handle a given request. Sometimes the system would initiate a follow-up interaction with a client in an attempt to understand the urgency and necessity of the request.  Such a system is called a "Bureaucracy".</p>
<p>A Bureaucracy is both good and bad. A national organization without bureaucratic rules will end up being corrupted. On the other hand a bureaucratic organization can sometimes be detached because the person making the request does not fit the profile of high priority client, which is a result of an imperfect rule base. Sometimes it is good that the system is indifferent to clients because an online game is not as important as police communications but as an individual you cannot see the big picture. If my HPC cluster could ask the student for the reason of urgency of his task and there would be something like "Personal" -&gt; "Romance" -&gt; "First Date", then my HPC cluster would know that it is not as important as "Personal" -&gt; "Romance" -&gt; "Wife's Birthday". Maybe this is too detailed but you get the point.</p>
<p>Today our software operates huge systems that justify <a href="http://en.wikipedia.org/wiki/Bureaucracy" target="_blank">Bureaucracy</a> but their design is more similar to an <a href="http://en.wikipedia.org/wiki/Adhocracy" target="_blank">Adhocracy</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/08/19/parallel-architecture-from-adhocracy-to-bureaucracy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
<enclosure url="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/08/GadShelly.wmv" length="1051805" type="video/x-ms-wmv" />
		</item>
		<item>
		<title>Automatic Parallelization: Design Pattern</title>
		<link>http://software.intel.com/en-us/blogs/2010/07/01/automatic-parallelization-design-pattern/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/07/01/automatic-parallelization-design-pattern/#comments</comments>
		<pubDate>Thu, 01 Jul 2010 23:23:58 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[design pattern]]></category>
		<category><![CDATA[guest blogger]]></category>
		<category><![CDATA[Parallel Computing]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/07/01/automatic-parallelization-design-pattern/</guid>
		<description><![CDATA[It is a well known fact that using more CPU cores will result in more processing power. The problem is that most algorithms have some degree of dependency. For example: X[n] = (X[n-1] * 12) + 11 Y[n] = (Sqrt( X[n] * X[n-1] ) + 15) * 33 See the part in red: X[n-1]. This means [...]]]></description>
			<content:encoded><![CDATA[<p>It is a well known fact that using more CPU cores will result in more processing power. The problem is that most algorithms have some degree of dependency. For example:</p>
<p>X[n] = (<span style="color: #ff0000;">X[n-1]</span> * 12) + 11<br />
Y[n] = (Sqrt( X[n] * X[n-1] ) + 15) * 33</p>
<p>See the part in red: <span style="color: #ff0000;">X[n-1]</span>. This means that X[n] is dependant on X[n-1] and in other words that you cannot evaluate a given X unless the previous X was already evaluated.</p>
<p>Normally what we do is start with the first iteration and keep going to the last. This way we can make sure that every given X[n] has the previous X[n-1] already prepared and given. This way:</p>
<pre name="code" class="c-sharp">for (int n = 0; n &lt; 10000; n++)
{
   X[n] = (X[n-1] * 12) + 11
   Y[n] = (Math.Sqrt( X[n] * X[n-1] ) + 15) * 33
}</pre>
<p><em>The code is given in C# for readability and can easily be implemented using C++ with a wrapper layer.</em></p>
<p>The code above works but it is difficult to make it run in parallel. The following design pattern can help us solve this.</p>
<p>For simplicity instead of using complex data structures for the algorithm we will use a simple structure with only a single double (floating point) as a member:</p>
<pre name="code" class="c-sharp">      public class DoubleValue
      {
         public double Value;
      }</pre>
<p>The code has two separate functions: One to calculate Y[n] and another to calculate X[n]:</p>
<pre name="code" class="c-sharp">      public static DoubleValue CalcX(int n)
      {
         DoubleValue retval = new DoubleValue();
         if (0 == n) retval.Value = 51;
         else retval.Value = Math.Pow(Math.Sqrt((X[n - 1].Value * X[n - 1].Value) + 2), 1.00001);
         return (retval);
      }

      public static DoubleValue CalcY(int n)
      {
         DoubleValue retval = new DoubleValue();
         if (0 == n) retval.Value = 0;
         else retval.Value = Math.Pow(Math.Sqrt((X[n].Value * X[n - 1].Value) + 12), 1.001);
         return (retval);
      }</pre>
<p><em>Math.Pow( ) raises A in power of B; Math.Sqrt calculate squair root</em></p>
<p>We create an array of results so every X[n] and y[n] is evaluated only once.</p>
<p>The design pattern works by consuming data on demand instead of preparing results in advance.  A special kind of array is required for this purpose. The idea is that when calculating any X[n] which requires X[n-1] we start with X[n] and suspend X[n] to resolve X[n-1] if it was not already prepared. This makes the algorithm evaluate data on demand.</p>
<p>Before I show you a simple  implementation of the on-demand evaluation pattern I will start with the 'user' code:</p>
<pre name="code" class="c-sharp">public static DoubleValue CalcX(int n)
{
   DoubleValue retval = new DoubleValue();
   if (0 == n) retval.Value = 51;
   else
   {
      for (int i = 0; i &lt; 100; i++)
      {
         retval.Value = Math.Pow(Math.Sqrt((X[n - 1].Value * X[n - 1].Value) + 2), 1.00001);
      }
   }
   return (retval);
}</pre>
<p>We are looping 100 times as a pseudo just so we can evaluate times, otherwise the code executes too fast to show the difference.</p>
<pre name="code" class="c-sharp">public static DoubleValue CalcY(int n)
{
   DoubleValue retval = new DoubleValue();
   if (0 == n) retval.Value = 0;
   else
   {
      for (int i = 0; i &lt; 5000; i++)
      {
         retval.Value = Math.Pow(Math.Sqrt((X[n].Value * X[n - 1].Value) + 12), 1.001);
      }
   }
   return (retval);
}</pre>
<p>I use 5000 iterations for any Y calculation and 100 iterations for any X calculation to demonstrate an algorithm in which the dependency is a small portion of the calculation.</p>
<p>Here is how we use the algorithm to calculate 100000 iterations:</p>
<pre name="code" class="c-sharp">      const int numberOfItems = 10000;

private void button_Serial_Click(object sender, EventArgs e)
{
   double val = 0;
   for (int i = 0; i &lt; numberOfItems; i += 4)
   {
      val = Y[i].Value;
   }
   val = Y[numberOfItems - 1].Value;
}</pre>
<p>The code is calling for "val = Y[numberOfItems - 1].Value;", Y[99,999] is the last value in the array and would require all X values to evaluate first. This is after we evaluated every 4'th Y value by using i += 4. This is to show you that values which are not really required are completely skipped and thus all X values must evaluate but only a quarter of Y values are evaluated. The code above is the serial version and here is the parallel version of the code:</p>
<pre name="code" class="c-sharp">private void button_Parallel_Click(object sender, EventArgs e)
{
   for (int i = 0; i &lt; numberOfItems; i += 4)
   {
      Y.Invoke(i);
   }
   Y.Complete();
   double val = Y[numberOfItems - 1].Value;
}</pre>
<p>The code above is calling Invoke(i) which will set this iteration as pending to a thread pool. After all required iterations are pending we call Complete() which will wait untill all values are evaluated. This time the different iterations of Y are run in parallel. According to the design pattern when Y[n] requires X[n] the code calculating Y[n] will suspend until X[n] is evaluated. Respectively X[n] will suspend until X[n-1] was evaluated. This will ensure that only required calculations are actually performed and that dependencies are respected when needed while still running free when possible.</p>
<p>Finally we get to the demo code.</p>
<p>The user code below mesures accurate times. Results are machine dependant. On my laptop I got 2500ms for the serial code and 1437ms for the parallel code. We don't get a 2 times preformance gain because of iteration overhead in .Net and most importantly because this is what we expected. There are dependencies which prevent fully parallel code. This specific sample is easy to read and works on Visual Studio 2008. Great performance increase is expected when using lambda expression with Visual Studio 2010.</p>
<p>Here is the user code:</p>
<pre name="code" class="c-sharp">private void button_Go_Click(object sender, EventArgs e)
{
   X = new ResultList&lt;DoubleValue&gt;(numberOfItems, CalcX);
   Y = new ResultList&lt;DoubleValue&gt;(numberOfItems, CalcY);
   DateTime start = DateTime.Now;
   double val = 0;
   for (int i = 0; i &lt; numberOfItems; i += 10)
   {
      val = Y[i].Value;
   }
   val = Y[numberOfItems - 1].Value;
   label1.Text = (DateTime.Now - start).TotalMilliseconds.ToString();
   for (int i = 0; i &lt; numberOfItems; i += 1000)
   {
      listBox1.Items.Add("Y[" + i.ToString() + "] = " + Y[i].Value.ToString());
   }

   X = new ResultList&lt;DoubleValue&gt;(numberOfItems, CalcX);
   Y = new ResultList&lt;DoubleValue&gt;(numberOfItems, CalcY);
   start = DateTime.Now;
   for (int i = 0; i &lt; numberOfItems; i += 10)
   {
      Y.Invoke(i);
   }
   Y.Complete();
   val = Y[numberOfItems - 1].Value;
   label1.Text = label1.Text +"\n"+ (DateTime.Now - start).TotalMilliseconds.ToString();
   for (int i = 0; i &lt; numberOfItems; i += 1000)
   {
      listBox1.Items.Add("Y[" + i.ToString() + "] = " + Y[i].Value.ToString());
   }
}</pre>
<p>The array used in the code above is of type ResultList which is a special type of list object which was designed to accomodate the design pattern. The code was writen for simplicity:</p>
<pre name="code" class="c-sharp">class ResultList&lt;T&gt; where T: new()
{
   public ResultList(int count, Calculate calculator)
   {
      Calculator = calculator;
      Values = new T[count];
      LockObjects = new object[count];
      for (int i = 0; i &lt; LockObjects.Length; i++) { LockObjects[i] = new object(); }
   }
   public delegate T Calculate(int iteration);
   public Calculate Calculator = null;
   protected T[] Values = null;
   protected object[] LockObjects = null;
   protected List&lt;IAsyncResult&gt; PendingOps = new List&lt;IAsyncResult&gt;();

   public T this[int n]
   {
      get { return (Execute(n)); }
   }
   protected T Execute(int n)
   {
      T retval = default(T);
      Monitor.Enter(LockObjects[n]);
      if (null == Values[n]) Values[n] = Calculator(n);
      retval = Values[n];
      Monitor.Exit(LockObjects[n]);
      return (retval);
   }
   public T Invoke(int n)
   {
      T retval = default(T);
      Monitor.Enter(LockObjects[n]);
      retval = Values[n];
      if (null == retval)
      {
         PendingOps.Add(Calculator.BeginInvoke(n, null, n));
      }
      return (retval);
   }
   public void Complete()
   {
      for (int i = 0; i &lt; PendingOps.Count; i++)
      {
         int n = ((int)PendingOps[i].AsyncState);
         Values[n] = Calculator.EndInvoke(PendingOps[i]);
         Monitor.Exit(LockObjects[n]);
      }
      PendingOps.Clear();
   }
}</pre>
<p>Finally here is the part which is algorithm specific:</p>
<pre name="code" class="c-sharp">const int numberOfItems = 10000;

public class DoubleValue
{
   public double Value;
}

static ResultList&lt;DoubleValue&gt; X;
static ResultList&lt;DoubleValue&gt; Y;

public static DoubleValue CalcX(int n)
{
   DoubleValue retval = new DoubleValue();
   if (0 == n) retval.Value = 51;
   else
   {
      for (int i = 0; i &lt; 100; i++)
      {
         retval.Value = Math.Pow(Math.Sqrt((X[n - 1].Value * X[n - 1].Value) + 2), 1.00001);
      }
   }
   return (retval);
}

public static DoubleValue CalcY(int n)
{
   DoubleValue retval = new DoubleValue();
   if (0 == n) retval.Value = 0;
   else
   {
      for (int i = 0; i &lt; 5000; i++)
      {
         retval.Value = Math.Pow(Math.Sqrt((X[n].Value * X[n - 1].Value) + 12), 1.001);
      }
   }
   return (retval);
}

private void Form1_Load(object sender, EventArgs e)
{
   X = new ResultList&lt;DoubleValue&gt;(numberOfItems, CalcX);
   Y = new ResultList&lt;DoubleValue&gt;(numberOfItems, CalcY);
}</pre>
<p>I would like to hear your opinion about this. Do you find this method useful for your application? Where in the application? Pros and cons are welcome as well.</p>
<p>Generally speaking I see this type of design patterns as the mid-range future of parallel computing and hopefully there are more to come. I am referring to the concept here as a design pattern when it is actually a pattern class which should have specific patterns for many particular cases as suitable for a real design pattern.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/07/01/automatic-parallelization-design-pattern/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>My Conclusions on Parallel Computing</title>
		<link>http://software.intel.com/en-us/blogs/2010/04/09/my-conclusions-on-parallel-computing/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/04/09/my-conclusions-on-parallel-computing/#comments</comments>
		<pubDate>Fri, 09 Apr 2010 19:10:29 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Guest Blog]]></category>
		<category><![CDATA[http://AsyncOp.com]]></category>
		<category><![CDATA[multi-core]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[user experience]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/04/09/my-conclusions-on-parallel-computing/</guid>
		<description><![CDATA[We have been dealing with parallel computing for some while now. Some of the ideas we had at the start proved to be wrong while others are only becoming relevant in the near future. No doubt about it, parallel computing was pushed and forced into the mainstream of computing just as Object Oriented was in [...]]]></description>
			<content:encoded><![CDATA[<p>We have been dealing with parallel computing for some while now. Some of the ideas we had at the start proved to be wrong while others are only becoming relevant in the near future. No doubt about it, parallel computing was pushed and forced into the mainstream of computing just as Object Oriented was in the previous millennia.</p>
<p><strong>Some History: Hardware</strong></p>
<p>The first to deal with parallel computing were hardware developers because the hardware supports multiple devices working at the same time, with different operation rates and response times. Hardware design is also <em>Event Driven</em> because devices work independently and issue an Interrupt event when required. The computer hardware we know today is fully parallel however it is centralized with a single CPU (Central Processing Unit) and multiple peripheral devices.</p>
<p><strong>Some History: Kernel</strong></p>
<p>The next to support parallel computing was the software infrastructure which in modern operating systems is the Kernel. The Kernel must support multiple events coming in the form of Hardware Interrupts and propagated upwards as Software Events. Kernels are commonly distributed in design as several Drivers can communicate with each other. The centralized object in the system is allowing communication between the drivers and supports synchronization but is not supposed to contribute to the application's business logic in any form or way.</p>
<p><strong>Some History: Network</strong></p>
<p>UNIX is based on services. A Service is a way to call a function over network. Network technologies required distributed design in which every element is completely parallel to the next and there is no single 'processor unit' as the system's master. UNIX took this to the next level with technologies such as services, pipes, sockets, mailslots, Fork and more. At a time when programming was a tedious work, developing an operating system to support Fork meant extensive efforts. Still UNIX had built in support for that mechanism which solves so many problems... Only we forgot how to use it and I don't remember seeing a new system design that had Fork in it.</p>
<p><strong>Some History: Applications</strong></p>
<p>When I just started with C programming and have just found out about threads I tried doing things in parallel just to see how it works. The result was, as you can imagine, by far worse. The application runs much slower, there are "Random Bugs" and the code looks terrible. The explanation I got was that there is only one CPU and the different threads compete over it. No Multi-Core CPU means that there is no ROI (return on investment) for using multiple threads and the large efforts required for a parallel design. The only reason to use a thread is when you really have to for example when there is need to wait for hardware or a network buffer.</p>
<p><strong>Parallel Computing Today</strong></p>
<p>A few years ago CPUs got to a certain hardware limitation which would have required special cooling. At this point the race to reduce silicon size and increase clock frequency has ended. Instead of spending massive amounts of silicon on the CPU for advanced algorithms to improve instruction pre-fetch, smaller and simpler CPUs are used and there is room for more CPUs on the same silicon wafer. We got the Multi-Core CPU which practically means several CPUs on the same computer.</p>
<p>At first the cores of a Multi-Core CPU were simpler than the single core one. These cores also operated in a much lower frequency which meant that an application designed for a single task operation had a massive performance impact when moving to a new computer, for the first time ever.</p>
<p>Parallel Computing has become main stream. We started with a long series of lectures about parallel computing. It seemed that people wanted to know about this subject but there was so much overhead that Parallel Computing simply scared people away. There is a huge ramp before you can be a good parallel programmer. Just as there is for object oriented programming. This meant that team leaders and architects were at the same level as beginner programmers, or perhaps with some very little advantage. Add to this the fact that there are massive amounts of code already written for a single core CPU and good advantages can be achieved after at least some re-write. Last but most important reason to reject parallel computing was that it is easier and <span style="text-decoration: underline;">cheaper</span> to buy another machine than to make the best out of the CPU cores. This was actually a boost for Cloud Computing.</p>
<p><strong>Who is doing Parallel Computing</strong></p>
<p>There are several types of parallel computing. The hardware is parallel so the Kernel is parallel. With this type of parallelism every worker is doing something else, and workers own their resources instead of sharing them. For a long while now DSP (Digital Signal Processing) chips are Multi-Core CPUs so that the algorithms executed on these chips can run faster. Algorithms and DSP chips are evaluated by MIPS which is the amount of instructions per time constant. Gaining performance increase with an algorithm means either using less instructions or adding more worker CPU cores. PCs also run algorithms such as face recognition, image detection, image filtering, motion detection, and more. The transition from single core CPU to a Multi-Core CPU was fast and simple.</p>
<p>Algorithm's increase in performance is relative to the amount of computations per data item. More computation more cores can be used. Image Blending (fade) is an example for an algorithm which cannot enjoy the use of more than a single core. Take an image and blend each pixel with the corresponding pixel of another image. Each pixel should be read from RAM then a simple addition and shift right are performed and then the result should be writen back to RAM. The CPU can operate at a rate of 3GHz and the RAM at 1GHz. For each pixel in the image we: Read pixel A, Read pixel B, Add, Shift, Write result pixel. Add another core and the CPU cores will mutually block on access to the memory. This is also true for Databases and database algorithms such as sort algorithms, linked lists, etc. For this reason the new Multi-Core CPUs have extensive support for parallel access to memory.</p>
<p><strong>Parallel Computing ROI</strong></p>
<p>Parallel Computing is the new future for computers. Object Oriented is no longer the new buzz word. I keep telling people that before they make an Object Oriented Design to their systems they should make flow charts. Good OOD is based on good system flow charts, whether you write them down or do it in your head as an art.</p>
<p>We all used to think that User Interface is the product and OOD is the way to do it. It now looks like we were wrong:</p>
<p><strong>User Experience is the prodcut and Parallel Design is the way to do it</strong>. User Experience (UX) is not User Interface (UI). User Interface defines what the product would look like, or in other words UI defines what the product <span style="text-decoration: underline;">is</span>. Object Oriented Design defines what the code looks like, or in other words OOD defines what the code <span style="text-decoration: underline;">is</span>. Parallel Computing defines how the code works, or in other words Parallel Computing defines what the code <span style="text-decoration: underline;">does</span>. User Experience defines how the application behaves, or in other words User Experience defines what the application <span style="text-decoration: underline;">does</span>.</p>
<p>I am not using a C++ library because it is using linked-lists. I am using that library because it can sort.</p>
<p>I am not buying a product because it looks like I want it to look, for this I can buy a framed picture instead. I am buying a product because it is doing something I need and it is not doing what I do not need.</p>
<p>Parallel Computing is the basis for User Experience. Even if you have a single core it is better to have good parallel design. As customers you know this, you don't want to accidentally hit "Print" instead of "Save" and now wait for 5 seconds punishment for the dialog to open so you can close it. (see <a href="http://www.asyncop.com/MTnPDirEnum.aspx?treeviewPath=">minute 43 for demo video</a>)</p>
<p>Today we have so many good resources and tools. Now is the time to learn how to work parallel and produce good prodcuts with good UX.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/04/09/my-conclusions-on-parallel-computing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hardware support for Locks</title>
		<link>http://software.intel.com/en-us/blogs/2010/02/22/hardware-support-for-locks/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/02/22/hardware-support-for-locks/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 21:00:00 +0000</pubDate>
		<dc:creator>Asaf Shelly</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[guest blogger]]></category>
		<category><![CDATA[parallel programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/02/22/hardware-support-for-locks/</guid>
		<description><![CDATA[Locks are a problematic mechanism because they can potentially slow down the system. Sometimes you just need them, usually when working with low-level API and the lower levels of an infrastructure. There are four basic ways for using a lock: * Spin-lock : will retain the CPU core until a condition is met * Atomic [...]]]></description>
			<content:encoded><![CDATA[<p>Locks are a problematic mechanism because they can potentially slow down the system. Sometimes you just need them, usually when working with low-level API and the lower levels of an infrastructure.</p>
<p>There are four basic ways for using a lock:</p>
<p>* Spin-lock : will retain the CPU core until a condition is met</p>
<p>* Atomic Operation : single operation using a predefined CPU Op-Code</p>
<p>* Kernel Lock : such as MUTEX which can be automatically unlocked if the thread terminates</p>
<p>* Fast-Lock : such as Critical Section which is light-weight and is more sensitive to bugs</p>
<p>Locks are implemented internally by either : preventing tasks from switching / disabling Interrupts / Internally using an Atomic Operation.</p>
<p>Atomic Operations use an internal lock. This lock is system wide and every time a core uses an Atomic Operation it slows down all the other cores. There is similar behavior with disabling Interrupts. This is because the CPU cannot know what it is extactly that we are locking. The lock object is not really connected to the resource / buffer, so the system uses the global lock.</p>
<p>Today that we have things like <strong><a href="http://software.intel.com/en-us/articles/optimizing-software-applications-for-numa/">NUMA</a></strong> which identifies different RAM modules I would expect to also have some form of hardware acceleration for locks.</p>
<p>First of all there is no reason to use a global lock if different cores access different physical RAM modules.</p>
<p>There is also no reason to use a CPU wide lock if two cores are running completely different applications and will never use the same lock objects. The CPU hardware can have some acceleration in which lock objects such as Critical Section and un-named MUTEX will only use the CPU wide lock if the same process is running on two different cores. Otherwise the lock should be internal to the Core.</p>
<p>If I could go too far I would even have the lock object related to the buffer in hardware table and only the thread that has the lock will have read / write access permission and the other threads or processes will have no page access permissions.</p>
<p>Which ever the solution may be, there should be some hardware support for lock objects. Memory mapped files are just not enough.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/02/22/hardware-support-for-locks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

