<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Intel® Software Network (FR) &#187; Acceler8</title>
	<atom:link href="http://software.intel.com/fr-fr/blogs/category/acceler8/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/fr-fr/blogs</link>
	<description></description>
	<lastBuildDate>Mon, 14 May 2012 06:49:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Create a Ubuntu 11.04 LiveUSB to use Intel® Parallel Studio XE</title>
		<link>http://software.intel.com/fr-fr/blogs/2012/05/14/create-a-ubuntu-1104-liveusb-to-use-intel-parallel-studio-xe/</link>
		<comments>http://software.intel.com/fr-fr/blogs/2012/05/14/create-a-ubuntu-1104-liveusb-to-use-intel-parallel-studio-xe/#comments</comments>
		<pubDate>Mon, 14 May 2012 06:49:52 +0000</pubDate>
		<dc:creator>Xavier Hallade (Intel)</dc:creator>
				<category><![CDATA[Acceler8]]></category>
		<category><![CDATA[ISN France]]></category>
		<category><![CDATA[programmation parallèle]]></category>

		<guid isPermaLink="false">http://software.intel.com/fr-fr/blogs/2012/05/14/create-a-ubuntu-1104-liveusb-to-use-intel-parallel-studio-xe/</guid>
		<description><![CDATA[You need a license for Intel® Parallel Studio XE for Linux and and at least a 4GB USB Key. Get an ISO image of Ubuntu 11.04. Create a new Ubuntu 11.04 LiveUSB, with persistence mode enabled (you can specify a size of 1mo for the persistence file, you will overwrite it with a ~3Go file [...]]]></description>
			<content:encoded><![CDATA[<p>You need a <a href="https://registrationcenter.intel.com/RegCenter/AutoGen.aspx?ProductID=1538&amp;AccountID=&amp;EmailID=&amp;ProgramID=&amp;RequestDt=&amp;rm=EVAL&amp;lang=">license</a> for Intel® Parallel Studio XE for Linux and and at least a 4GB USB Key.</p>
<p>Get an <a href="http://releases.ubuntu.com/natty/ubuntu-11.04-desktop-amd64.iso">ISO image</a> of Ubuntu 11.04.</p>
<p>Create a new Ubuntu 11.04 LiveUSB, with persistence mode enabled (you can specify a size of 1mo for the persistence file, you will overwrite it with a ~3Go file in the next step).</p>
<p>To do that from Windows, you can use <a href="http://www.linuxliveusb.com/">LiLi</a> :<br />
<a href="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/screenshot-lili.png"><img src="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/screenshot-lili-180x300.png" alt="" width="180" height="300" class="aligncenter size-medium wp-image-675" /></a><br />
but any other tool like Unetbootin or USB Universal Installer is fine.</p>
<p>Download <a href="http://intel-software-academic-program.com/download/ubuntu/casper-rw.zip">casper-rw.zip</a> and unzip casper-rw to the root of your freshly built Ubuntu LiveUSB :<br />
<a href="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/screenshot-casper.png"><img src="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/screenshot-casper-300x155.png" alt="" width="300" height="155" class="aligncenter size-medium wp-image-674" /></a><br />
It is a persistence file that contains an installation of Parallel Studio XE 2011 SP1 Update1.</p>
<p>Create a new folder at the root of your key, named "intel-licenses", and put your .lic file inside it.</p>
<p>Now you are ready to boot on this LiveUSB and directly use Intel® tools to accelerate your code <img src='http://software.intel.com/fr-fr/blogs/wordpress/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/fr-fr/blogs/2012/05/14/create-a-ubuntu-1104-liveusb-to-use-intel-parallel-studio-xe/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Retour d&#039;expérience concours Acceler&#039;8</title>
		<link>http://software.intel.com/fr-fr/blogs/2012/02/01/retour-dexprience-concours-acceler8/</link>
		<comments>http://software.intel.com/fr-fr/blogs/2012/02/01/retour-dexprience-concours-acceler8/#comments</comments>
		<pubDate>Wed, 01 Feb 2012 16:53:41 +0000</pubDate>
		<dc:creator>rimaxime</dc:creator>
				<category><![CDATA[Acceler8]]></category>
		<category><![CDATA[programmation parallèle]]></category>

		<guid isPermaLink="false">http://software.intel.com/fr-fr/blogs/2012/02/01/retour-dexprience-concours-acceler8/</guid>
		<description><![CDATA[La nouvelle édition du concours acceler'8 a pris fin il y'a un peu plus d'un mois. Contrairement au concours précédent, nous n'avons pas publié d'article. Il faudrait que nous le fassions à l'occasion. C'etait une part intéressante du concours précédent. Les contraintes de la vie courante reprennent vite leur place. Il m'a fallu un peu [...]]]></description>
			<content:encoded><![CDATA[<p>La nouvelle édition du concours acceler'8 a pris fin il y'a un peu plus d'un mois. Contrairement au concours précédent, nous n'avons pas publié d'article.</p>
<p>Il faudrait que nous le fassions à l'occasion. C'etait une part intéressante du concours précédent.</p>
<p>Les contraintes de la vie courante reprennent vite leur place. Il m'a fallu un peu de temps pour me décider d'écrire ce message.</p>
<p><em><br />
<div id="attachment_626" class="wp-caption aligncenter" style="width: 310px"><a href="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/IMAG0105.jpg"><img src="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/IMAG0105-300x179.jpg" alt="Image_SSD" width="300" height="179" class="size-medium wp-image-626" /></a><p class="wp-caption-text">Une petite photo du disque SSD gagné <img src='http://software.intel.com/fr-fr/blogs/wordpress/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p></div><br />
</em></p>
<p>Cet édition a été un événement intense. Contrairement à l'édition précédente, nous avons dû travailler sur un problème reconnu.<br />
Un gros travail de recherche à effectuer. Ce fut loin d'être facile, mais au combien passionnant.</p>
<p>Nous sommes partis finalement sur une simple implémentation de l'algorithme de <a href="http://alexeigor.wikidot.com/kadane">Kadan 2D</a> pour rechercher les coordonnées du tableau maximum à l'intérieur d'un tableau contenant des valeurs négatives et positives.</p>
<p>Cette fois ci, nous ne découvrions plus la programmation parallèle. Nous avons décidé de nous lancer dans le C++ et d'exploiter la librairie <a href="http://software.intel.com/en-us/articles/intel-tbb/">Intel Threading Block</a>.</p>
<p>Nous avons à première vue était agréablement surpris. Nous avions trouvé dans cette librairie un excellent compromis entre la puissance et la souplesse de configuration des threads.<br />
Ce concours fut vraiment l'occasion de découvrir ce produit. Je ne saurai que en recommander l'usage.</p>
<p>Nous avons commis une erreur. Mon binome a perdu pratiquement un week end à chercher à améliorer les performances d'une section de code qui en faite répondait totalement à nos exigences.</p>
<p>Au final, le goulet d'étranglement de notre application se trouvait à la lecture du fichier.</p>
<p>Nous nous sommes rendus compte de ça, trop tard malheureusement, grace à l'outil <a href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/">Intel Vtune</a>. Nous avons découvert trop tard que nous pouvions l'utiliser sur la MTL.<br />
Donc un simple regret, ne pas avoir eu accès à cet outil plus facilement. Même sur la MTL, c'etait délicat et relativement fastidieux de l'utiliser.</p>
<p><em><br />
<div id="attachment_632" class="wp-caption aligncenter" style="width: 310px"><a href="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/Intel_SW-IntelVTuneAmplifierXEOverview4161.jpg"><img src="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/Intel_SW-IntelVTuneAmplifierXEOverview4161-300x168.jpg" alt="Intel Vtune" width="300" height="168" class="size-medium wp-image-632" /></a><p class="wp-caption-text">Intel Vtune</p></div><br />
</em></p>
<p>Donc au cours de cette édition, nous avons découvert l'écosystème proposé par Intel que nous ne connaissions que de nom. La dernière fois, la découverte de la programmation parallèle avait pris tout notre temps.<br />
Face au problème que nous avons rencontré,c'est certain que avec ces outils, nous l'aurions facilement contourné. Ils sont beaucoup plus simples à utiliser que je le pensais à l'origine. Au final, quelques clics et nous avions la vue complète de l'exécution de notre application. J'imagine vraiment l'intéret pour l'optimisation de systèmes plus complexes.</p>
<p>Je profite de ce poste pour soumettre une idée pour la prochaine edition.<br />
Nous aurions aimé disposer d'une machine virtuelle avec tous les outils d'Intel installés pour la durée du concours. Ca nous aurait simplifié la vie et poussé à les découvrir.<br />
Je suis certain qu'il nous reste encore beaucoup à en apprendre.</p>
<p>Nous avons beaucoup apprécié l'ouverture à l'international. Ca donnait une dimension et un enjeu différent.<br />
C'etait impressionnant la communauté de développeurs qui a été rassemblés et l'engouement sur les forums anglais.<br />
Nous avons également beaucoup apprécié l'entraide qui reignait et le partage (notamment les generateurs de matrice). C'est grace à certaines de ces contributions que nous avons réussi à utiliser Intel Vtune, sinon nous serions totalement passés à coté.</p>
<p>Cette fois ci, avec un code non optimisé pour la lecture du fichier, nous avons terminé 23 ème.<br />
Même si c'est honorable, il ne nous reste plus qu'à mieux faire lors de la prochaine édition.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/fr-fr/blogs/2012/02/01/retour-dexprience-concours-acceler8/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Maximum Subarray Problem - Simple Parallelization and Optimizations</title>
		<link>http://software.intel.com/fr-fr/blogs/2011/12/22/maximum-subarray-problem-simple-parallelization-and-optimizations/</link>
		<comments>http://software.intel.com/fr-fr/blogs/2011/12/22/maximum-subarray-problem-simple-parallelization-and-optimizations/#comments</comments>
		<pubDate>Thu, 22 Dec 2011 13:26:53 +0000</pubDate>
		<dc:creator>neshone</dc:creator>
				<category><![CDATA[Acceler8]]></category>
		<category><![CDATA[programmation parallèle]]></category>

		<guid isPermaLink="false">http://software.intel.com/fr-fr/blogs/2011/12/22/maximum-subarray-problem-simple-parallelization-and-optimizations/</guid>
		<description><![CDATA[University of Novi Sad Faculty of Technical Sciences, Department of Computing and Control authors: Predrag Ilkic, Nenad Jovanovic Date: October 15th 2011 - November 15th 2011 Introduction: This article is an explanation of our method for solving the maximumim subarray problem during the Intel Acceler8 contest. The team consisted of one fourth year university student [...]]]></description>
			<content:encoded><![CDATA[<p><span lang="EN"></p>
<p align="center"><span lang="EN"><font size="2">University of Novi Sad</font>
</p>
<p align="center"><font size="2">Faculty of Technical Sciences, Department of Computing and Control</font></p>
<p align="center"><font size="2">authors: Predrag Ilkic, Nenad Jovanovic</font></p>
<p align="center"><font size="2">Date: October 15th 2011 - November 15th 2011</font></p>
<h2 class="MsoNormal"><span><font size="5">Introduction:</font></span><span><span lang="EN"></h2>
<p>This article is an explanation of our method for solving the maximumim subarray problem during the Intel Acceler8 contest. The team consisted of one fourth year university student and one fourth grade high school student.</p>
<p>The problem was the following: in the given matrix, find the subarray (corner coordinates) that has the maximum sum of elements. The algorithm used for the solution was the well-known Kadane's algorithm adapted for two-dimensional<br />
arrays. We started the project using the OpenMP, but at one point we switched to pthreads because we got much better results. The algorithm wasn't too complex for implementation so the transition went quite smoothly. The only downside of<br />
pthread solution was that we switched to it much too late so we never had the time to try out some of our ideas that failed with OpenMP. The compiler used was gcc, no other compilers were tested.</p>
<p><span lang="EN"></p>
<h2>1 - matrix parsing</span></span></span></span></span></h2>
<p><span lang="EN"></p>
<p>Matrix parsing and loading was done sequentially. While in the OpenMP phase of the project, we tried to parallelize the matrix parsing and loading, but without success. The results were pretty much the same at best, so the solution<br />
was dropped. No parsing parallelization was attempted with pthreads due to the lack of time.</p>
<p>Matrix parsing was done using tha standard fread() function. One chunk at the time was loaded into memory. It was parsed, firstly, by counting all of the spaces to get the column count and then by counting the linebreaks to get the<br />
row count. After acquiring the matrix size, we allocated the appropriate amount of memory. The matrix was then loaded into memory in an appropriate way depending on the number of rows and columns (whether or not rows&gt;columns -<br />
this effects the complexity of the solution which is calculated O(rows*rows*columns)). This enabled us not to waste time on matrix transposing later. The loading was done using our own integer reading. We found it much<br />
faster compared to standard fscanf(). The matrix from the file itself was never actually loaded into memory. We just used the integers read to generate the prefix sum matrix. Only the prefix sum matrix was processed in this<br />
solution.</p>
<p></span><span lang="EN"></p>
<h2>2 - parallelization and optimizations</h2>
<p><span lang="EN"></p>
<p>The whole parallelization process was done in OpenMP and later literally translated to pthreads. Since the Kadane algorithm consists of three nested for loops, the parallelization was done on the outer most for loop to decrease<br />
overhead and increase data locality and cache hit rate. All interations in the outer loop are completely independant so parallelizing the algorithm in this way enables great scalability. Every iteration is picked up for processing by one of<br />
the worker threads. Every iteration is processed one by one until all iterations are processed. We tried different types of scheduling with OpenMP. Static scheduling was the least effective and dynamic and guided scheduling gave very<br />
similar results so we preoceeded with dynamic scheduling which has just been explained. After the transition to pthreads, dynamic scheduling was kept and no other scheduling types were tried due to before mentioned lack of time. Every<br />
worker thread collects it's own results from which is later picked the final result. This was done to eliminate any thread synchronization that might slow down performance.</p>
<p><span lang="EN"></p>
<h2>3 - results</h2>
<p><span lang="EN"></p>
<p>For starters, here are some matrix parsing and loading timings:</p>
<table border="1" cellspacing="0" cellpadding="3" width="20%" align="left">
<tr>
<td width="50%" align="center">matrix size</td>
<td align="center">timing</td>
</tr>
<tr>
<td width="50%" align="center">1000x1000</td>
<td align="center">39.35 ms</td>
</tr>
<tr>
<td width="50%" align="center">2000x2000</td>
<td align="center">153.2 ms</td>
</tr>
<tr>
<td width="50%" align="center">5000x5000</td>
<td align="center">984.52 ms</td>
</tr>
<tr>
<td width="50%" align="center">8000x8000</td>
<td align="center">2466.63 ms</td>
</tr>
</table>
<p></P></span></p>
<h2></span>&nbsp;</h2>
<h2></span></span>&nbsp;</h2>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><span lang="EN"></p>
<p>And here are the timings of the whole algorithm:</p>
<p></span></p>
<table border="1" cellspacing="0" cellpadding="3" width="30%" align="left">
<tr>
<td align="center">matrix size</td>
<td width="33%" align="center">number of cores</td>
<td width="33%" align="center">timing</td>
</tr>
<tr>
<td align="center">1000x1000</td>
<td width="33%" align="center">1</td>
<td width="33%" align="center">1.523 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">8</td>
<td width="33%" align="center">0.223 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">16</td>
<td width="33%" align="center">0.132 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">24</td>
<td width="33%" align="center">0.104 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">32</td>
<td width="33%" align="center">0.089 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">40</td>
<td width="33%" align="center">0.082 s</td>
</tr>
<tr>
<td align="center">2000x2000</td>
<td width="33%" align="center">1</td>
<td width="33%" align="center">12.093 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">8</td>
<td width="33%" align="center">1.683 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">16</td>
<td width="33%" align="center">0.891 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">24</td>
<td width="33%" align="center">0.646 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">32</td>
<td width="33%" align="center">0.525 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">40</td>
<td width="33%" align="center">0.465 s</td>
</tr>
<tr>
<td align="center">5000x5000</td>
<td width="33%" align="center">20</td>
<td width="33%" align="center">11.886 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">30</td>
<td width="33%" align="center">7.584 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">40</td>
<td width="33%" align="center">6.041 s</td>
</tr>
<tr>
<td align="center">8000x8000</td>
<td width="33%" align="center">20</td>
<td width="33%" align="center">52.050</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">30</td>
<td width="33%" align="center">34.685</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">40</td>
<td width="33%" align="center">26.234</td>
</tr>
</table>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><span lang="EN"></p>
<p>No fancy graphs or pictures <img src='http://software.intel.com/fr-fr/blogs/wordpress/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><span lang="EN"></p>
<h2>4 - conclusion</h2>
<p><span lang="EN"></p>
<p>As the results show, for smaller matrices, speed up drops greatly because of the large part of execution that the sequential part of the code takes. As the matrix gets larger, the speed up grows to almost linear.</p>
<p>   The code, makefile and the readme file from the contest are attached. The password for the archive is "secret". We tried to make the code well commented, so it should be easy to understand. You can download all of it <a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/12/solution.zip">here</a>.</p>
<p>For the end, we'd like to say that the contest was a pleasure, a great experience and a great chance to learn something new and useful as well as try it out properly(40 core MTL was great:)).</p>
<p></span></span></span></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/fr-fr/blogs/2011/12/22/maximum-subarray-problem-simple-parallelization-and-optimizations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Maximum Subarray Problem using TBB and Pipelines</title>
		<link>http://software.intel.com/fr-fr/blogs/2011/12/16/maximum-subarray-problem-using-tbb-and-pipelines/</link>
		<comments>http://software.intel.com/fr-fr/blogs/2011/12/16/maximum-subarray-problem-using-tbb-and-pipelines/#comments</comments>
		<pubDate>Fri, 16 Dec 2011 10:01:17 +0000</pubDate>
		<dc:creator>ph0b</dc:creator>
				<category><![CDATA[Acceler8]]></category>
		<category><![CDATA[programmation parallèle]]></category>

		<guid isPermaLink="false">http://software.intel.com/fr-fr/blogs/2011/12/16/maximum-subarray-problem-using-tbb-and-pipelines/</guid>
		<description><![CDATA[Algorithm Kadane 2d's classic algorithm has a complexity of O(r²c), where r is the number of rows and c the number of cols. We use it when there is more columns than rows, but instead of tranposing the matrix for the opposite case, we developed a second algorithm that is O(c²r). It's basically a transposition [...]]]></description>
			<content:encoded><![CDATA[<h2 style="margin-top:20px">Algorithm</h2>
<p>Kadane 2d's classic algorithm has a complexity of O(r²c), where r is the number of rows and c the number of cols.</p>
<p>We use it when there is more columns than rows, but instead of tranposing the matrix for the opposite case, we developed a second algorithm that is O(c²r). It's basically a transposition of Kadane 2d's algorithm :</p>
<pre>
      for(size_t colStartIndex=0; colStartIndex!=numberOfCols; ++colStartIndex)
         for(size_t rowIndex=0;rowIndex!=numberOfRows;++rowIndex)
            for (size_t colEndIndex = colStartIndex; colEndIndex!=numberOfCols; ++colEndIndex)
                    // do one step of kadane's 1D maximum subarray search (one step for one row)
		    // a kadane's 1D search is associated to a (colStartIndex, colEndIndex) pair
</pre>
<p>The position of the loop on rows is carefully choosen. It's not the most innerloop because we would have very bad cache locality, and it can't be parallelized so we didn't put it as the most outer loop.</p>
<h2 style="margin-top:25px">Pipelining</h2>
<p>We assumed that reading the file was a really slow and sequential process, and our work is based on that assumption. The other teams have proven that it wasn't true - on the MTL at least.</p>
<p>Our goal was to start searching the maximum subarray before the file was entirely read. We modified our algorithm's implementations to work on slices of matrix, in order, coming from the pipeline.</p>
<p>The pipeline serially read slices of the input file, then each slice is parsed in parallel, and sent to the search.</p>
<p>We choosed to work on slices of files of a size inferior to the L3 cache and used circular buffers to avoid memory reallocation.</p>
<p>The O(c²r) algorithm doesn't need to remember the previous slice of matrix to work : only two values, indexed by colStartIndex and colEndIndex, are needed for continuing the kadane 1d search. So the memory footprint is independant on the number of rows.</p>
<p>Here is the pipeline definition for the O(c²r) implementation :</p>
<pre>
    tbb::parallel_pipeline(
            ntoken,
            tbb::make_filter(
                tbb::filter::serial_in_order,
                InputFileReader(file, textSlices, ntoken)
            ) // read a chunk of the file (n-rows)
            &amp;
            tbb::make_filter(
                tbb::filter::parallel,
                TextLinesToArray(matrixes, /*prefixedSum=*/true)
            ) 	// parse this chunk to fill corresponding matrix
		// with a prefixedSum done on each line
            &amp;
            tbb::make_filter(
                tbb::filter::serial_in_order,
                MaximumSubarraySearchCCRStep(&amp;kadaneSavedResults, &amp;numberOfRows, &amp;result)
            )
        ); // add the matrix to the search
</pre>
<h2 style="margin-top:25px">Parallelization</h2>
<p>We already parallelized the parsing of the matrix using the pipeline. For the searching part, we used tbb::parallel_reduce() on the most outer loop (on colStartIndex for both algorithms). Given the triangle-balanced nature of the problem, we defined a specific tbb::range.<br />
The length of this range is defined using it's length, then multiplied by it's position. The grainsize of the parallel_reduce is divided by the number of rows for the O(c²r) algorithms and by the number of cols for the other one. Splitting the range is done at the first 3rd instead of at the middle of the range.<br />
This is a bit tricky, but a range should stay as simple as possible because its split function is used a really high number of times. Every of our tentatives to do smarter splitting involving a bit more calculation didn't succeed.<br />
We didn't have the time to try to directly use parallel tasks instead of ranges and parallel_reduce. It might have much better performances, allowing a better task splitting with less overhead.</p>
<p>Our program scaled very well but there is much overhead that doesn't depend on the number of cores, these are some results we had on 40 :</p>
<ul>
<li>1000x1000 : 1.64s user 0.06s system 2022% cpu 0.084 total</li>
<li>2000x2000 : 11.69s user 0.09s system 3138% cpu 0.375 total</li>
<li>4000x4000 : 108.18s user 0.25s system 3684% cpu 2.943 total</li>
<li>8000x8000 : 758.06s user 1.21s system 3865% cpu 19.642 total</li>
<li>10000x10000 : 1431.10s user 1.79s system 3895% cpu 36.781 total</li>
</ul>
<p>The source code is here : <a href='http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/maxSubarraySearch-TBB-Pipelines.zip'>maxSubarraySearch-TBB-Pipelines.zip</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/fr-fr/blogs/2011/12/16/maximum-subarray-problem-using-tbb-and-pipelines/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Maximum Subarray Problem using  PThreads</title>
		<link>http://software.intel.com/fr-fr/blogs/2011/12/02/maximum-subarray-problem-using-pthreads/</link>
		<comments>http://software.intel.com/fr-fr/blogs/2011/12/02/maximum-subarray-problem-using-pthreads/#comments</comments>
		<pubDate>Fri, 02 Dec 2011 10:32:00 +0000</pubDate>
		<dc:creator>spoii</dc:creator>
				<category><![CDATA[Acceler8]]></category>

		<guid isPermaLink="false">http://software.intel.com/fr-fr/blogs/2011/12/02/maximum-subarray-problem-using-pthreads/</guid>
		<description><![CDATA[Maximum Subarray Problem Parallelization using PThreads Maximum Subarray Problem Parallelization using PThreads Catalin Ionut Fratila, Vlad-Marian Spoiala University Politehnica of Bucharest Faculty of Automatic Control and Computers, Computer Science Department The algorithm we used for solving the maximum subarray problem was Kadane's 2D algorithm. A implementation of the algorithm is presented here: http://alexeigor.wikidot.com/kadane. Our implementation [...]]]></description>
			<content:encoded><![CDATA[<p><TITLE>Maximum Subarray Problem Parallelization using PThreads</TITLE></p>
<p><!--End of Navigation Panel--><br />
<H1 ALIGN="CENTER">Maximum Subarray Problem Parallelization using PThreads</H1><br />
<P ALIGN="CENTER"><STRONG>Catalin Ionut Fratila, Vlad-Marian Spoiala</STRONG><br />
<BR><I>      University Politehnica of Bucharest</I><br />
<BR><FONT SIZE="-1">      Faculty of Automatic Control and Computers, Computer Science Department</FONT><br />
</P><br />
<HR></p>
<p><P><br />
The algorithm we used for solving the maximum subarray problem was Kadane's 2D algorithm. A implementation of the algorithm is presented here: http://alexeigor.wikidot.com/kadane.</p>
<p><P><br />
Our implementation was done in C. Parallelization was done using PThreads. We started to develop solutions in both TBB and PThreads, but decided to proceed with the PThreads because we obtained better results.</p>
<p><P><br />
We used Intel's icc compiler for generating the executable. Using the icc compiler gave us a 3-5% performance improvement over the same code compiled using gcc. Other compilers were not tested.</p>
<p><P><br />
Profiling was done locally using the Linux command line utility perf and Intel VTune. Perf was used to obtain raw profiling data that was of interest (number of cache misses, number of cycles etc.), while VTune was used for obtaining more complex profiling data like hotspots or concurrency information.</p>
<p><P></p>
<p><H1><A NAME="SECTION00010000000000000000"><br />
Parsing the input</A><br />
</H1><br />
Since the input file does not contain the dimensions of the need input file we need to estimate the dimensions of the matrix. To do this we read the first line and determine the number of columns and the size of the first line. We then determine the total size of the file by doing and fseek to the end. We estimate the number of rows by dividing the total size of the file to the size of the first line. </p>
<p><P><br />
Our first attempt attempt at reading the numbers was done using fscanf. Better results were obtained by giving up using fscanf and instead using strtok. In our final version we gave up using strtok and decided to parse the line by hand. We implemented a minimal and fast, but unsafe version of the atoi function. Our final version of the read function was 2-3 times faster than using fscanf to get the numbers from the line. VTune's hotspots analysis was used here to determine where most of the time is spent in the read function.</p>
<p><P><br />
If our initial estimation of the total number of lines is smaller than the actual number of lines we reallocate the memory area reserved for storing the matrix. </p>
<p><P><br />
We did not attempt to parallelize the read function.</p>
<p><P></p>
<p><H1><A NAME="SECTION00020000000000000000"><br />
Parallelization</A><br />
</H1><br />
Splitting the workload between threads was done with regard to the number of rows each thread receives. The total workload is represented by the total number of row pairs obtained from the 2 outer for loops in the Kadane 2D algorithm. We have a total of M * (M + 1) / 2 row pairs, where M is the number of lines. We try to obtain a balanced splitting of the pairs among the threads. This is done by assigning iterations of the most outer loop to each thread so that the number of pairs that each thread receives is close to M * (M + 1) / 2 / NT where NT is the number of threads.<br />
For large number of threads and small number of rows threads some threads might not receive any work.</p>
<p><P><br />
Each thread will compute a partial result for the iterations it was assigned. This partial result is composed of the maximum subarray sum value and the coressponding coordinates in the original matrix. The master thread goes through the partial results and selects the one with the maximum value for the subarray sum.</p>
<p><P><br />
When running on the MTL each thread is forced to run on a certain core using the pthread_setaffinity_np function. Although this improves speedup significantly when using a large number of cores on the MTL, using this function on other machines caused a decrease of performance. Because of this its use is restricted to running our code on the MTL.</p>
<p><P><br />
The threads are joined using a semaphore when the workload is completed: the main thread does NT waits on the semaphore, while each thread that finishes its workload (including the main thread) does a post operation on the semaphore. This was used to prevent serialization overhead from calling pthread_join for each thread from the main thread.</p>
<p><P></p>
<p><H1><A NAME="SECTION00030000000000000000"><br />
Other optimizations</A><br />
</H1><br />
Since the problem is of O(M^2*N) complexity we transpose the matrix if the number of rows exceeds the number of columns.</p>
<p><P><br />
The matrix is stored as a contiguous chunk of memory. This improves data locality.</p>
<p><P><br />
The three loops are optimized by reducing the number of operations needed to do memory accesses. We use 3 pointers to access the numbers in the matrix: 2 are increased in a sequential manner (we add 1 to the address) while the third is increase in a non-sequential (we add N to the address, where N is the number of columns).</p>
<p><P></p>
<p><H1><A NAME="SECTION00040000000000000000"><br />
Results</A><br />
</H1><br />
We present the results for three matrices of sizes 2000 x 2000, 5000 x 5000 and 6000 x 6000 when running on the MTL. Time is expressed in seconds. These are the results for single runs on machines that were not exclusively reserved for the job so there might be some variation in time when compared to running on a exclusively reserved batch node.</p>
<p><a href="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/test2000.png"><img src="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/test2000-300x156.png" alt="" width="300" height="156" class="aligncenter size-medium wp-image-546" /></a><br />
<P></p>
<p><a href="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/test5000.png"><img src="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/test5000-300x150.png" alt="" width="300" height="150" class="aligncenter size-medium wp-image-548" /></a><br />
<P><br />
<a href="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/test6000.png"><img src="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/test6000-300x150.png" alt="" width="300" height="150" class="aligncenter size-medium wp-image-550" /></a><br />
<P></p>
<p><H1><A NAME="SECTION00050000000000000000"><br />
Code</A><br />
</H1></p>
<p>The code can be downloaded here: <a href='http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/solution1.zip'>solution</a><br />
Password for the archive is "secret"<br />
<P></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/fr-fr/blogs/2011/12/02/maximum-subarray-problem-using-pthreads/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Maximum Subarray Problem - Algorithmic Optimizations</title>
		<link>http://software.intel.com/fr-fr/blogs/2011/11/28/the-maximum-subarray-problem-algorithmic-optimizations/</link>
		<comments>http://software.intel.com/fr-fr/blogs/2011/11/28/the-maximum-subarray-problem-algorithmic-optimizations/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 08:55:10 +0000</pubDate>
		<dc:creator>candreolli</dc:creator>
				<category><![CDATA[Acceler8]]></category>
		<category><![CDATA[ISN France]]></category>

		<guid isPermaLink="false">http://software.intel.com/fr-fr/blogs/2011/11/28/the-maximum-subarray-problem-algorithmic-optimizations/</guid>
		<description><![CDATA[Acceler8 contest Acceler8 contest Andreolli Cédric - Garcia Pascal - Templé Arthur Date: October 15th 2011 - November 15th 2011 Abstract: This report explains the approach we used for resolving the ``Maximum Subarray Problem'' during the Intel Acceler8 contest. We are two students in fourth year and a teacher at INSA of Rennes. The idea [...]]]></description>
			<content:encoded><![CDATA[<p><TITLE>Acceler8 contest</TITLE></p>
<p><P><br />
<H1 ALIGN="CENTER">Acceler8 contest</H1><br />
<P ALIGN="CENTER"><STRONG><SPAN CLASS="textbf">Andreolli</SPAN> Cédric - <SPAN CLASS="textbf">Garcia</SPAN> Pascal - <SPAN CLASS="textbf">Templé</SPAN> Arthur</STRONG><br />
</P><br />
<BR><P ALIGN="CENTER"><B>Date:</B> October 15th 2011 - November 15th 2011</P></p>
<p><HR></p>
<p><H3>Abstract:</H3><br />
<DIV CLASS="ABSTRACT"><br />
This report explains the approach we used for resolving the ``Maximum Subarray Problem'' during the <SPAN CLASS="textit">Intel Acceler8</SPAN> contest. We are two students in fourth year and a teacher at <SPAN CLASS="textit">INSA</SPAN> of Rennes.<br />
The idea of the contest was to build an algorithm able to scale on computers with large number of cores. Here are the different steps of the development process. We hope you will enjoy reading it as much as we enjoyed the contest.<br />
</DIV><br />
<P></p>
<p><BR></p>
<p><H1><A NAME="SECTION00020000000000000000"><br />
Introduction</A><br />
</H1></p>
<p><H2><A NAME="SECTION00021000000000000000"><br />
The maximum subarray problem</A><br />
</H2><br />
The first step was, of course, to understand the problem. The <SPAN CLASS="textit">maximum subarray problem</SPAN> is a well known algorithmic problem. It consists of finding the rectangle (a submatrix) with the maximum area in a matrix of integers.</p>
<p><P><br />
A lot of documentation about this problem can be found on the <SPAN CLASS="textit">Internet</SPAN>. First, we started to study and test some algorithms we found such as the <SPAN CLASS="textit">Kadane</SPAN> algorithm. </p>
<p><P><br />
We choose to use <SPAN CLASS="textit">C++</SPAN> for solving the problem. <SPAN CLASS="textit">C++</SPAN> offers the advantage to be quite low level if you need it, but you also have access to higher level objects such as vectors, lists, etc. Besides, we are currently learning this language at <SPAN CLASS="textit">INSA</SPAN> so it was a good opportunity to practice.</p>
<p><P></p>
<p><H2><A NAME="SECTION00022000000000000000"><br />
OpenMP</A><br />
</H2><br />
For the parallelization part, we chose to use <SPAN CLASS="textit">OpenMP</SPAN>. We made this choice for two main reasons. </p>
<p><P><br />
First, the video tutorial was really easy to understand and went over a lot of functionalities really helpful for what we planned to do. Furthermore, the <SPAN CLASS="textit">MTL</SPAN> did not recquire a lot of specific settings to work with <SPAN CLASS="textit">OpenMP</SPAN> and we thought it was a good idea not to loose time on compilation problems. </p>
<p><P><br />
The second reason is that none of us ever used it and it is always interesting to discover new libraries. One of the main interest that offers <SPAN CLASS="textit">OpenMP</SPAN> is that it allows you to incrementally parallelized your code. With really few changes, you can improve the speed of a sequential program and this is one of the big interest we found in this library.</p>
<p><P><br />
Finally, <SPAN CLASS="textit">OpenMP</SPAN> offers the advantage of adding very few lines of code. For example, you do not have to use multiple semaphores or mutexes to protect your critical datas.</p>
<p><P></p>
<p><H2><A NAME="SECTION00023000000000000000"><br />
Different tasks</A><br />
</H2><br />
Once we finished to discover the possibilities of <SPAN CLASS="textit">OpenMP</SPAN>, we decided to split the project into some independent tasks.<br />
The next sections explain those different tasks. You will also find the program documentation in the <code>doc</code> directory.</p>
<p><P></p>
<p><H1><A NAME="SECTION00030000000000000000"><br />
Reading the files</A><br />
</H1></p>
<p><H2><A NAME="SECTION00031000000000000000"><br />
Problems encountered</A><br />
</H2><br />
As we decided to work with <SPAN CLASS="textit">C++</SPAN>, we started to create a really simple file reader. So we used the basic <SPAN CLASS="textit">STL</SPAN> operations at first and we proceeded as follows:</p>
<p><BR><br />
<IMG WIDTH="567" HEIGHT="153" ALIGN="BOTTOM" BORDER="0" SRC="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/img2.png" ALT="\begin{lstlisting}<br />
std::ifstream file(fileName, std::ios::in);<br />
std::string line;...<br />
...ringstream::in);<br />
while(tmp&#187;num){<br />
//Parsing code here<br />
}<br />
}<br />
}<br />
\end{lstlisting}"><br />
<BR><br />
But actually, the line:<br />
<BR><br />
<IMG WIDTH="566" HEIGHT="25" ALIGN="BOTTOM" BORDER="0" SRC="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/img3.png" ALT="\begin{lstlisting}<br />
while(tmp&#187;num){<br />
\end{lstlisting}"><br />
<BR><br />
had really bad performances. We then decided to write our own integer parser. After few tests, it was really faster. Then we started to think about the possibility to parallelize this step.</p>
<p><P><br />
We first took some time to see what was going on when we read a file and parsed it into a vector. As a matter of fact, we could see that the cores were absolutly not busy during this operation.<br />
As there were a lot of hard drive access, the cores were spending most of their time to wait until data arrived. It was not optimal but for us, it was not possible to parallelize the file reading operation because there is only one bus between the hard drive and the main memory.</p>
<p><P></p>
<p><H2><A NAME="SECTION00032000000000000000"><br />
The way we resolved it</A><br />
</H2><br />
Finally, after more tries, we realized that loading the whole file into main memory was quite fast. We call <code>buffer</code> this array of characters.<br />
We then imagined a trick to use parallelization to speed-up the parsing of the files.<br />
<BR><br />
Actually, once the file is in main memory, it is really fast to run throught it. So we decided to go two times through the whole file.<br />
The first time, to get the matrix dimensions, the second, to parse the file.</p>
<p><P></p>
<p><H3><A NAME="SECTION00032100000000000000"><br />
Getting the matrix dimensions</A><br />
</H3><br />
At this point, we wanted to spend the least time we can on this step but as the <code>buffer</code> is an array of characters, it is really difficult to parallelize the whole process.<br />
So we decided to sequentially read the first line (until a '\n' is found) and count the number of columns of the matrix thanks to the white spaces.<br />
As the input file must respect some specifications, we are sure that the number of columns of the matrix is equal to the number of spaces on a line plus one.<br />
Once this step is over, we just need to rush through the rest of the file to count the '\n'. This last step can easily be parallelized with<br />
<SPAN CLASS="textit">OpenMP</SPAN> and a <code>#pragma omp for</code> directive.</p>
<p><P><br />
During this process, we register the addresses of the new lines into a vector named <code>addressTab</code>.</p>
<p><P></p>
<p><H3><A NAME="SECTION00032200000000000000"><br />
Parsing the file</A><br />
</H3><br />
Once we get the matrix dimensions, we get a <code>vector</code> (<code>addressTab</code>) which contains the addresses of all new lines in <code>buffer</code>.<br />
We can now parallelize the file parsing. Depending on the number of cores we have, we split <code>buffer</code> into different parts based on the new lines addresses (see Figure 1).</p>
<p><P></p>
<p><DIV ALIGN="CENTER"><A NAME="fig:parsing"></A><A NAME="79"></A><br />
<TABLE><br />
<CAPTION ALIGN="BOTTOM"><STRONG>Figure 1:</STRONG><br />
The file parsing parallelization</CAPTION><br />
<TR><TD><br />
<DIV ALIGN="CENTER"><br />
<IMG WIDTH="436" HEIGHT="276" ALIGN="BOTTOM" BORDER="0" SRC="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/readfile1.png" ALT="Image readfile1"><br />
</DIV></TD></TR><br />
</TABLE><br />
</DIV></p>
<p><A NAME="file"></A>The goal of the process is to fill a two dimensions <code>vector</code>. We will call this <code>vector</code>: <code>matrix</code>. The elements in <code>addressTab</code> corresponds to the addresses of the beginning of each new lines in <code>buffer</code> and as a core is at least responsible of an entire line, it can put the parsed numbers to the correct position in <code>matrix</code>.<br />
<BR><br />
<P><br />
Our algorithm for solving the maximum subarray problem is faster if the matrix of <em>n</em> rows and <em>m</em> columns is such that <em>n</em> &le; <em>m</em>. So we have two different functions for generating an optimal matrix (<code>readLinesOrdered</code> and <code>readLinesReversed</code>).<br />
We did not factorize this part of the code because of optimization concerns. Indeed, this would have add a test condition for every numbers we had to put in <code>matrix</code>. On big files, this could have been an important waste of time.</p>
<p><P></p>
<p><H1><A NAME="SECTION00040000000000000000"><br />
The one dimension algorithm</A><br />
</H1><br />
We actually do not have a lot to tell here.<br />
We started to work on this algorithm and wrote some parallelized functions to do it, but as it is already a very fast sequential algorithm (<br />
<em>O(n)</em> complexity), the improvments were not significant. Actually, the time needed for solving the problem, was not significant compare to the time needed to read and parse the file.</p>
<p><P><br />
The two dimensions problem was hard enough, so we decided not to spend more time on this particular case.</p>
<p><P></p>
<p><H1><A NAME="SECTION00050000000000000000"><br />
The two dimensions algorithm</A><br />
</H1></p>
<p><H2><A NAME="SECTION00051000000000000000"><br />
The Kadane algorithm</A><br />
</H2><br />
As said before, the two dimensions maximum subarray problem is a well known problem and it is possible to find some documentations on the <SPAN CLASS="textit">Internet</SPAN>.<br />
We decided to work on the <SPAN CLASS="textit">Kadane</SPAN> algorithm which is quite simple to understand.<br />
We started to work on the parallelization process for this algorithm. The two dimensions <SPAN CLASS="textit">Kadane</SPAN> algorithm is a generalization of the one dimension case.<br />
It is based on three overlapped <code>for</code> loops.<br />
<BR><br />
<P><br />
As we chose to use <SPAN CLASS="textit">OpenMP</SPAN>, we had two main approaches. One was to use the <code>#pragma omp for</code> directive, the second one was to use the <code>#pragma omp task</code> one.<br />
We started with the easy solution (the <code>#pragma omp for</code>). After running some tests, we could see that the cores were not busy during all the process. That is the reason why we imagined a solution with the <code>#pragma omp task</code> directive.<br />
<BR><br />
<P><br />
The idea was to compute the number of tasks we wanted to create (let's call it <code>numberOfTasks</code>) and launch the tasks. Each task is responsible for computing the maximal sum (and the associated coordinates) for a part of the matrix.<br />
The first <code>for</code> loop in the <SPAN CLASS="textit">Kadane</SPAN> algorithm iterates on the rows of the matrix.<br />
The parallelization is done by dividing the loop in <code>numberOfTasks</code> tasks. Each thread assigned to a task realize a <code>for</code> loop but instead of incrementing by one, we increment by <code>numberOfTasks</code>.<br />
Here is the corresponding part of the code (where <em>n</em> is number of rows of the matrix):<br />
<BR><br />
<IMG WIDTH="566" HEIGHT="191" ALIGN="BOTTOM" BORDER="0" SRC="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/img9.png" ALT="\begin{lstlisting}<br />
...<br />
for (unsigned int rowStart = taskNumber; rowStart &lt; n; ro...<br />
...<br />
...<br />
for (...) {<br />
...<br />
for(...) {<br />
...<br />
}<br />
...<br />
}<br />
...<br />
}<br />
\par<br />
\end{lstlisting}"><br />
<BR></p>
<p><P><br />
It is important to create more tasks than the number of cores. Indeed, each task does not have the same computation time and in this case, we would have to wait for the longest tasks at the end of the function. Increasing the number of tasks will reduce the time we have to wait because the tasks are shorter.</p>
<p><P></p>
<p><H2><A NAME="SECTION00052000000000000000"><br />
Our algorithm</A><br />
</H2></p>
<p><P><br />
After having parallelized the two dimensions <SPAN CLASS="textit">Kadane</SPAN> algorithm, we started to work on the algorithm itself.<br />
First we had the idea to create a new one-dimension array (called <code>maxSumStartingAtRow</code>) which contained for each index <code>i</code> an upper bound on the maximal sum you can obtained for a sub-matrix starting from i to the end of the original matrix.</p>
<p><P><br />
We used the <code>maxSumStartingAtRow</code> array to break the second <code>for</code> loop if the current maximal sum found in the current task was already bigger than the upper bound on the maximal sum (lines 9 to 12 in figure <A HREF="kadane">2</A>).</p>
<p><P></p>
<p><DIV ALIGN="CENTER"><A NAME="fig:kadane"></A><A NAME="102"></A><br />
<TABLE><br />
<CAPTION ALIGN="BOTTOM"><STRONG>Figure 2:</STRONG><br />
Part of our Kadane algorithm.</CAPTION><br />
<TR><TD><IMG WIDTH="583" HEIGHT="293" BORDER="0" SRC="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/img12.png" ALT="\begin{figure}\begin{lstlisting}[numbers=left, numberstyle=\footnotesize , stepn...<br />
...<br />
break;<br />
}<br />
...<br />
for(...){<br />
...<br />
}<br />
...<br />
}<br />
...<br />
}<br />
\end{lstlisting}<br />
\end{figure}"></TD></TR><br />
</TABLE><br />
</DIV></p>
<p><P><br />
For the generation of <code>maxSumStartingAtRow</code>, we first did a preprocessing operation which was parallelized. This operation started from the bottom of the <code>matrix</code>, and computed some one-dimension <SPAN CLASS="textit">Kadane</SPAN> and added the value from the bottom to the top of <code>maxSumStartingAtRow</code> as illustrated in figure <A HREF="fill">3</A>. </p>
<p><P></p>
<p><DIV ALIGN="CENTER"><A NAME="fig:fill"></A><A NAME="111"></A><br />
<TABLE><br />
<CAPTION ALIGN="BOTTOM"><STRONG>Figure 3:</STRONG><br />
The maxSumStartingAtRow generation</CAPTION><br />
<TR><TD><br />
<DIV ALIGN="CENTER"><br />
<IMG WIDTH="382" HEIGHT="237" ALIGN="BOTTOM" BORDER="0" SRC="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/ssum.png" ALT="Image ssum"><br />
</DIV></TD></TR><br />
</TABLE><br />
</DIV></p>
<p><P><br />
With this solution, the problem was that <code>maxSumStartingAtRow</code> was a really bad estimation of the real maximum sum starting at a row. This is quite easy to understand with the example in figure <A HREF="bad">4</A>. </p>
<p><DIV ALIGN="CENTER"><A NAME="fig:bad"></A><A NAME="119"></A><br />
<TABLE><br />
<CAPTION ALIGN="BOTTOM"><STRONG>Figure 4:</STRONG><br />
The problem with maxSumStartingAtRow</CAPTION><br />
<TR><TD><br />
<DIV ALIGN="CENTER"><br />
<IMG WIDTH="422" HEIGHT="237" ALIGN="BOTTOM" BORDER="0" SRC="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/ssum2.png" ALT="Image ssum2"><br />
</DIV></TD></TR><br />
</TABLE><br />
</DIV></p>
<p><P><br />
A solution we found was to compute at regular intervals some <SPAN CLASS="textit">Kadane</SPAN> in two dimensions. Indeed, on huge arrays, this last algorithm is way more accurate. The two dimensions algorithm was used to decrease the difference between the real and the computed values in <code>maxSumStartingAtRow</code>.</p>
<p><P><br />
The last problem we had was about the necessary time needed to compute this preprocessing operation. Even with this preprocessing, the solving time of the two dimensions algorithm was reduced, but on big arrays (<em>10000 x 10000</em>) the total time was only decreased by few seconds (due to a long preprocessing).<br />
<BR><br />
<P><br />
Finally, our last two dimensions algorithm does not use preprocessing. The trick is that the classical <SPAN CLASS="textit">Kadane</SPAN> two dimensions algorithm spends its time computing sums from a row to an other. This allows us to use the solve part of our algorithm to fill the <code>maxSumStartingAtRow</code> (the following piece of code is placed on line 19 in figure <A HREF="kadane">2</A>):</p>
<p><P><br />
<BR><br />
<IMG WIDTH="568" HEIGHT="102" ALIGN="BOTTOM" BORDER="0" SRC="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/img15.png" ALT="\begin{lstlisting}<br />
...<br />
if (!pruningOccured) {<br />
..."><br />
<BR></p>
<p><P><br />
The <code>maxSumStartingAtRow</code> <code>vector</code> is initialized with the next line:<br />
<BR><br />
<IMG WIDTH="566" HEIGHT="25" ALIGN="BOTTOM" BORDER="0" SRC="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/img16.png" ALT="\begin{lstlisting}<br />
for (int i = 0; i &lt; n; ++i) maxSumStartingAtRow[i] = LONG_MAX&#187;2;<br />
\end{lstlisting}"><br />
<BR><br />
As we use the <code>maxSumStartingAtRow</code> in an addition, we want to avoid overflow. This is the reason why we divide by 4 the <code>LONG_MAX</code> value.<br />
<BR><br />
<P><br />
Finally, we add a variable <code>bestSoFar</code> shared by all threads to indicate the best value found so far by all achieved tasks. This value is used to initialized the <code>sum</code> variable (line 2 in figure <A HREF="kadane">2</A>. We replace <code>sum = 0</code> by <code>sum = bestSoFar</code>) to cut part of the matrix based on the best values found on other tasks.<br />
<BR><br />
<P><br />
Note that having more tasks than available cores is important for our pruning method too. Because threads can communicate more often partial results to others and in doing so they help each other to prune some part of the computation.<br />
<BR><br />
<P><br />
To conclude, the complexity of our <SPAN CLASS="textit">Kadane</SPAN> algorithm is still in <em>O(n² x m)</em>, but due to the cut we use in the second for loop, most of the time, we can improve the speed of the resolution.</p>
<p><P></p>
<p><H1><A NAME="SECTION00060000000000000000"><br />
The final algorithm</A><br />
</H1></p>
<p><P><br />
The final part of the algorithm is realy simple, the method is in the <code>MaxSubArrayPb</code> class <code>static void computeMaxSubArray(char* fileName)</code>.<br />
It only uses the different functions we wrote. Here are the different tasks executed by the algorithm:<br />
<DL><br />
<DT><STRONG>Load the file: </STRONG></DT><br />
<DD>This just load the file into main memory.</p>
<p></DD><br />
<DT><STRONG>Get the matrix size: </STRONG></DT><br />
<DD>Here, the only goal is to get the dimensions of the input matrix and to fill a vector with the addresses of all the lines. </p>
<p></DD><br />
<DT><STRONG>Find the good orientation: </STRONG></DT><br />
<DD>As explained before, our algorithm is way more efficient with some particular arrangments of the initial matrix. </p>
<p></DD><br />
<DT><STRONG>Generate the vector : </STRONG></DT><br />
<DD>This operation turn the input file into a <SPAN CLASS="textit">C++</SPAN> two dimensions vector.</p>
<p></DD><br />
<DT><STRONG>Launch the good algorithm : </STRONG></DT><br />
<DD>Regarding the number of rows of the input vector, we choose to launch the one dimension <SPAN CLASS="textit">Kadane</SPAN><br />
algorithm or our two dimensions algorithm.</p>
<p></DD><br />
<DT><STRONG>Reverse the result if necessary : </STRONG></DT><br />
<DD>If the third step reversed the matrix, we need to rotate the result to have the good output coordinates.</p>
<p></DD><br />
<DT><STRONG>Print the result : </STRONG></DT><br />
<DD>Probably no need of explanation here <img src='http://software.intel.com/fr-fr/blogs/wordpress/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> .<br />
</DD><br />
</DL><br />
The main function, in the <code>main.cpp</code> file, only calls the <code>computeMaxSubArray</code> method on each files passed as parameters.<br />
It also defines the number of threads the algorithms have to use.<br />
As it is really short, here is the code :<br />
<BR><br />
<IMG WIDTH="567" HEIGHT="153" ALIGN="BOTTOM" BORDER="0" SRC="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/img20.png" ALT="\begin{lstlisting}<br />
int main(int argc, char* argv[]){<br />
if(argc &lt; 3){<br />
cout&#171;''Par...<br />
...){<br />
MaxSubArrayPb::computeMaxSubArray(argv[i]);<br />
}<br />
return 0;<br />
}<br />
\end{lstlisting}"><br />
<BR><br />
<H1><A NAME="SECTION00070000000000000000"><br />
Conclusion</A><br />
</H1><br />
This constest was really interesting in a lot of different aspects. First of all, it involved team work between teacher and students wich was really rewarding.<br />
We all learned a lot of thing on a topic we didn't know well.<br />
As computers have more more and more cores, this kind of computation is probably going to become a very important issue in the futur application devloppment.<br />
This contest was the occasion to discover existing technologies. It was also the occasion to pratice on a 40 cores computer, a thing that is not possible every day.<br />
The topic of the contest, the maximum subarray problem, was an interesting problem to try to parallelize.<br />
It was quite simple to understand and it allowed us to use multiple ways to parallelize our program.</p>
<p><P><br />
The available resources, put at our disposal by <SPAN CLASS="textit">Intel</SPAN> were adapted to beginners in the parallel computing learning. We enjoyed learning by watching the video tutorial.</p>
<p><P><br />
To conclude, it was a real rich experience and we want to thank <SPAN CLASS="textit">Intel</SPAN> for the organization of this contest.</p>
<p><P><br />
Finally, you can download our full packages. </p>
<p>The first one, MTL_package.zip contains the files we sent for the contest. The makefile is adapted for the <em>MTL</em>.<br />
The second one, Normal_package.zip, should run on your personnal computer. You just need to have g++ 4.5.1 or later installed on your PC.</p>
<p><a href='http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/MTL_package.zip'>MTL_package.zip</a><br />
<a href='http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/Normal_package.zip'>Normal_package.zip</a></p>
<p>In both zip files, you will find the same explanations in the detailled_explannations.pdf file. You will also have access to the doxygen documentation.</p>
<p>We hope you enjoyed reading this article.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/fr-fr/blogs/2011/11/28/the-maximum-subarray-problem-algorithmic-optimizations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Subarray Problem - A static NUMA-Aware approach</title>
		<link>http://software.intel.com/fr-fr/blogs/2011/11/24/subarray-problem-a-static-numa-aware-approach/</link>
		<comments>http://software.intel.com/fr-fr/blogs/2011/11/24/subarray-problem-a-static-numa-aware-approach/#comments</comments>
		<pubDate>Thu, 24 Nov 2011 12:19:44 +0000</pubDate>
		<dc:creator>krahnack</dc:creator>
				<category><![CDATA[Acceler8]]></category>
		<category><![CDATA[ISN France]]></category>
		<category><![CDATA[programmation parallèle]]></category>

		<guid isPermaLink="false">http://software.intel.com/fr-fr/blogs/2011/11/24/subarray-problem-a-static-numa-aware-approach/</guid>
		<description><![CDATA[The subarray problem on a n*m matrix is sequentially solved using an algorithm known as the Kadane 2D algorithm. This algorithm has a O(n²m) complexity. The sequential algorithm is written using 3 loops : for i in (0..n) // &#60;- We parallelize that for j in (i..n) for k in (0..m) //do work with matrix[j][k] [...]]]></description>
			<content:encoded><![CDATA[
<div>The subarray problem on a n*m matrix is sequentially solved using an algorithm known as the Kadane 2D algorithm. This algorithm has a O(n²m) complexity. The sequential algorithm is written using 3 loops :
      </div>
<pre>
         for i in (0..n)   // &lt;- We parallelize that
		 for j in (i..n)
			 for k in (0..m)
			    //do work with matrix[j][k]
      </pre>
<div>Our solution does not try to optimize the work performed inside the inner loop, so we skip the details of what is actually done inside. We chose to parallelize only the outer loop (index <b>i</b>).</div>
<div>In order to parallelize the outer loop on K cores, we chose to split it into K tasks of equal duration. This approach has several advantages :</p>
<ul>
<li>The algorithm is very simple : there is no need to steal work or do complex load balancing between the K cores.</li>
<li>Each thread works on big continuous portions of the matrix, which maximizes cache usage.</li>
<li>We know in advance what the threads are going to do and which data are going to be accessed so we can do smart NUMA optimizations.</li>
</ul></div>
<div>In this article, we explain: how we achieved to split the work into K equal tasks and how we optimized the treatment of these tasks.</div>
<h2>1-Creating K tasks of equal duration</h2>
<table>
<tr>
<td>
			 <img src="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/splitting.png" style="padding:15px"></img><br />
			 <label><b>Fig. 1</b> - <i>K=4 equal areas in a triangle</i></label>
		</td>
<td style="padding-left:30px">
<div>In order to split a <tt>for i (0..n)</tt> loop into K tasks, one often create K tasks <tt>[i=0..n/K],[i=n/K..2*n/K]...[i=(K-1)*n/K,K]</tt>. However, this simple solution does not work well in our case because the second loop (index <b>j</b>) starts at index <b>i</b>. This means that when <tt>i==0</tt>, <tt>n</tt> iterations are done in the second loop and when <tt>i==n-1</tt> only <tt>1</tt> iteration is done in the second loop! The amount of work depending on <b>i</b> is represented in Figure 1. This figure represents an example of the work to be done on a 250*m matrix. When <tt>i==0</tt>, 250 iterations are done; when <tt>i==250</tt>, only 1 iteration is done. The total quantity of work to be done is equal to the area of the triangle.</div>
<div>Splitting the work into K equal tasks is equivalent to creating K equal areas inside the above mentioned triangle.</div>
<div>For example, in Figure 1, representing the work to be done on a 250*m matrix, a close-to-optimal partionning is the following :</p>
<ul>
<li>Thread 0 doing i (0-34) = 7939 <b>j</b> iterations (area A1)</li>
<li>Thread 1 doing i (34-74) = 7875 <b>j</b> iterations (area A2)</li>
<li>Thread 2 doing i (74-125) = 7860 <b>j</b> iterations (area A3)</li>
<li>Thread 3 doing i (125-250) = 7701 <b>j</b> iterations (area A4)</li>
</ul>
<p>			With this partionning, there is at most a 3% difference in the number of iterations performed by each thread.
		     </p></div>
</td>
</tr>
</table>
<div>
	In order to find the last index that a thread <b>idx</b> should process (e.g., 34 for thread 0 in the above example), we use the following formula:</p>
<pre>
    int last_index = 0;
    do {
	    last_index++;
    } while((last_index)*(n) - (last_index+1)*(last_index)/2 &lt; (idx+1) * n * (n - 1) / 2 / K);
	 </pre>
<p>Where <tt>n</tt> in the number of lines of the matrix, <tt>idx</tt> is the thread number and <tt>K</tt> the number of threads.
      </div>
<div>
	This loop increments <tt>last_index</tt> until the amount of work done between <tt>i=0</tt> and <tt>i=last_index</tt> is equal to <tt>idx*(total-work-to-be-done/number-of-workers)</tt>. The calculation of "the amount of work done" is the calculation of the area of a trapeze. (E.g., on figure 1 the area A1, the work done by thread 0, represents the area of a trapeze.)
    </div>
<div>Actually this could also be calculated with the following formula:</p>
<pre>
    last_index = 2*n - (&radic;<span style="text-decoration:overline">(4*n*n-4*n+1)*K*K+((-4*<b>idx</b>-4)*n*n+(4*<b>idx</b>+4)*n)*K</span>+(2*n-1)*K)</span>/(2*K);
	</pre>
<p>	... but is is actually slower than doing the loop! (We think that the compiler is doing really smart things and that the loop is actually optimized and transformed into a much more efficient formula.)</p>
<h2>2-NUMA optimizations</h2>
<div>As mentioned earlier, we also do NUMA optimizations. <img src='http://software.intel.com/fr-fr/blogs/wordpress/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  In order to improve the locality of the memory accessed by the threads, we have:</p>
<ul>
<li>Created a thread pool per NUMA node in the system. Each thread pool is totally independent from the others. Each thread pool is controlled by a master thread scheduled on the same NUMA node as the pool it controls.</li>
<li>The creation of the K tasks is done in parallel by each master thread (actually each thread creates K/4 tasks since there is 4 NUMA nodes on the MTL).</li>
<li>Before giving the tasks to its workers, each master thread <b>duplicates the matrix on the local NUMA node</b>. This ensures that, when the matrix does not fit in cache, the worker threads fetch data from their local memory. This optimization actually give a <b>+25%</b> performance boost at 40 cores. Lessons learned: pay attention to the data locality. <img src='http://software.intel.com/fr-fr/blogs/wordpress/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </li>
<li>(Note for those who might think that it is an incredible waste of memory: a 10K*10K matrix occupies 380MB in RAM. The MTL machines has 64GB. So one copy per node = a "waste" of 1.5GB = 2.3% of the memory of the machine = really negligible compared to the gain.)</li>
</ul></div>
<h2>3-Other performance optimizations</h2>
<div>
<ul>
<li>Our approach falls back on the sequential algorithm when the parallel algorithm is considered too costly. (E.g. the cost of duplicating the matrix and managing the thread pool cannot be amortized.)</li>
<li>Since the subarray algorithm is of O(n²m) complexity, it is sometimes worth to transpose the matrix before any computation, in order to have n&lt;m. Experiments showed that transposing becomes worthy as soon as the difference in complexity is above 5K operations.</li>
<li>Both reading and transposing the matrix are done in parallel using our thread pool. The input file is mapped in memory and each reader thread is responsible to parse 800Ko of the input file and creates a partial matrix corresponding to what it has read. All submatrices are then merged using a simple memcpy operation.</li>
</ul></div>
<h2>4-Figure for nerds</h2>
<div>Time to present some results!</div>
<div>
	      <img src="http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/speedup.png" style="padding-bottom:15px"></img><br /><b>Fig 2</b> - <i>Speedup of our algorithm on a 10K*10K matrix</i><br />
              The algorithm has an near optimal speedup between 10 and 40 cores (x3.94) and between 1 and 40 cores (x36.8). This means that, according to Amdhal's law, more than 99.77% of our code is parallel. For those interested, it takes 5.9s at 40 cores to parse a 10K*10K matrix.<br />
	      We think that the speedup seen by the Intel team might have been a little lower due to our static partitionning of data: on the final test 2 cores were fully loaded, which means that our partionning was no longer optimal. Nevertheless our solution seems to have behaved quite nicely even when (intuitively) load balancing could have been required.
     </div>
<p></p>
<h2>5-Code</h2>
<div>Finally, here's a link to <a href='http://software.intel.com/fr-fr/blogs/wordpress/wp-content/uploads/solution.zip'>our code</a></div>
<p></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/fr-fr/blogs/2011/11/24/subarray-problem-a-static-numa-aware-approach/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Méthodes de lire un fichier d&#039;entrée</title>
		<link>http://software.intel.com/fr-fr/blogs/2011/11/22/mthodes-de-lire-un-fichier-dentre/</link>
		<comments>http://software.intel.com/fr-fr/blogs/2011/11/22/mthodes-de-lire-un-fichier-dentre/#comments</comments>
		<pubDate>Tue, 22 Nov 2011 11:10:47 +0000</pubDate>
		<dc:creator>wtx2338</dc:creator>
				<category><![CDATA[Acceler8]]></category>

		<guid isPermaLink="false">http://software.intel.com/fr-fr/blogs/2011/11/22/mthodes-de-lire-un-fichier-dentre/</guid>
		<description><![CDATA[Ce test est basé sur des articles sur internet. Le but de ce test est de trouver une façon de lire des entiers à partir d'un fichier le plus vite possible, c'est un travail qu'on doit fait au début de notre programme du concours. Nous avons fait des test et voilà leur résultat. La premier [...]]]></description>
			<content:encoded><![CDATA[<p>Ce test est basé sur des articles sur internet. Le but de ce test est de trouver une façon de lire des entiers à partir d'un fichier le plus vite possible, c'est un travail qu'on doit fait au début de notre programme du concours. Nous avons fait des test et voilà leur résultat.</p>
<p>La premier réflexion qu'on a fait est de utiliser cin et cout dans std, nous savons cette façon n'est pas le plus vite mais nous ne savons pas sa vitesse. Nous avons fait un tes avec 10000000 entier, (et tous les autres test est sur ces 10000000 aussi )le code est très simple:</p>
<p><code><br />
std::cin&gt;&gt;data[i]<br />
</code><br />
Le résultat est très mauvais, oui très mauvais:</p>
<p>time ./t1<br />
real	0m5.615s<br />
user	0m5.544s<br />
sys	0m0.068s</p>
<p>On a quand même utilisé 5-6 seconde pour lire un fichier, de plus dans notre programme cette partie est une partie ne peut pas être parallélisée, donc en cherchant des méthode sur internet, nous avons commencé un autre test.<br />
La deuxième méthode est de utiliser scanf, on dit que c'est beaucoup plus vite que cin, nous avons juste changé une ligne de code:</p>
<p><code><br />
for (i;i&lt;MAXN;i++)<br />
	scanf(&quot;%d&quot;,&amp;data[i]);<br />
</code></p>
<p>Le résultat est vraiment beaucoup mieux cette fois ci:</p>
<p>time ./t2<br />
real	0m1.909s<br />
user	0m1.844s<br />
sys	0m0.060s</p>
<p>En fait, cin est super lente parce que chaque fois il fait des synchronisation avec stdin, une personne a dit sur internet qu'on peut aussi fermer cette synchronisation avec cette ligne de code là:<br />
<code><br />
std::ios::sync_with_stdio(false).<br />
</code><br />
Comme cela, cin peut avoir même vitesse que scanf.</p>
<p>Nous pensons qu'on peut encore mieux faire, les deux méthode est lente peut être à cause de la vérification du type, donc nous avons fait une méthode de fgetc qui va lire caractère par caractère d'un flux :</p>
<p><code><br />
FILE *fd=freopen("input.txt","rb",stdin);<br />
int i=0;<br />
int j=0;<br />
char p;<br />
while(!feof(fd)){<br />
	p=fgetc(fd);<br />
	if (p == ' ') i++;<br />
	if(p=='\n') j++;<br />
}</code></p>
<p>Ce méthode de là peut aussi indiquer le nombre du ligne et du colonne, et le résultat est mieux que les deux premiers:</p>
<p>time ./t3<br />
real	0m1.412s<br />
user	0m1.376s<br />
sys	0m0.032s</p>
<p>Après nous avons trouvé une méthode sur internet qui a utilisé fread, cette méthode-là peut être encore plus vite:</p>
<p><code><br />
FILE *fd=fopen("input.txt","r");<br />
int len = fread(buf,1,MAXS,fd);<br />
buf[len] = '';<br />
int i;<br />
numbers[i=0]=0;<br />
for (char *p=buf;*p &amp;&amp; p-buf&lt;len;p++){<br />
	if (*p == &#039; &#039;){<br />
		numbers[++i]=0;}<br />
	else{<br />
		numbers[i] = numbers[i] * 10 + *p - &#039;0&#039;;}<br />
}</p>
<p></code></p>
<p>Le principe est de lire tout fichier dans le mémoire et de travailler sur le mémoire après, et le résultat est beaucoup mieux qu'avant:</p>
<p>time ./t2<br />
real	0m0.288s<br />
user	0m0.184s<br />
sys	0m0.108s</p>
<p>Cette méthode est 10 fois plus vite que celle de scanf et 5 fois plus vite que celle de fgetc, enfin nous avons utilisé celle-ci pour notre programme.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/fr-fr/blogs/2011/11/22/mthodes-de-lire-un-fichier-dentre/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Introduction aux &quot;Ranges&quot; des TBB</title>
		<link>http://software.intel.com/fr-fr/blogs/2011/11/22/introduction-aux-ranges-des-tbb/</link>
		<comments>http://software.intel.com/fr-fr/blogs/2011/11/22/introduction-aux-ranges-des-tbb/#comments</comments>
		<pubDate>Tue, 22 Nov 2011 11:10:28 +0000</pubDate>
		<dc:creator>megra</dc:creator>
				<category><![CDATA[Acceler8]]></category>
		<category><![CDATA[programmation parallèle]]></category>
		<category><![CDATA[acceler8]]></category>
		<category><![CDATA[parallel_for]]></category>
		<category><![CDATA[range]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/fr-fr/blogs/2011/11/22/introduction-aux-ranges-des-tbb/</guid>
		<description><![CDATA[Bonjour à tous, Je vais vous présenter une fonctionnalité de la bibliothèque TBB que j'ai eu l'occasion de découvrir durant le concours Acceler8. Pour rappel, TBB qui est l'acronyme de "Threading Building Blocks" est une bibliothèque développée par Intel qui vise à faciliter le parallélisme. Rappel sur les TBB L'une des fonctionnalités très appréciée est [...]]]></description>
			<content:encoded><![CDATA[<p>Bonjour à tous,</p>
<p>Je vais vous présenter une fonctionnalité de la bibliothèque <a href="http://threadingbuildingblocks.org/">TBB</a> que j'ai eu l'occasion de découvrir durant le concours <a href="http://software.intel.com/fr-fr/articles/Acceler8France/">Acceler8</a>.<br />
Pour rappel, <a href="http://threadingbuildingblocks.org/">TBB</a> qui est l'acronyme de "Threading Building Blocks" est une bibliothèque développée par Intel qui vise à faciliter le parallélisme.</p>
<h2>Rappel sur les TBB</h2>
<p>L'une des fonctionnalités très appréciée est le <a href="http://threadingbuildingblocks.org/files/documentation/a00233.html">parallel_for</a>, et ses dérivées tels que le <a href="http://threadingbuildingblocks.org/files/documentation/a00233.html">parallel_reduce</a>.<br />
Ils permettent de paralléliser une boucle très facilement, voici un exemple de code qui parallélise une fonction d'affichage simple, qui pourrait être remplacée par un traitement long et coûteux.</p>
<p>Ainsi le code suivant affiche les entiers de 1 à 41 :<br />
<a href="http://software.intel.com/file/39812">Code source brut</a>.<br />
<a href="http://paste.pocoo.org/show/507007/">Code source avec coloration syntaxique</a>.</p>
<p>Peut être parallélisé ainsi :<br />
<a href="http://software.intel.com/file/39813">Code source brut</a>.<br />
<a href="http://paste.pocoo.org/show/507009/">Code source avec coloration syntaxique</a>.</p>
<p>L'ordre d'affichage est parallélisé, et donc, les nombres ne sont pas dans l'ordre.</p>
<p>Le "range" utilisé ici est le "tbb::blocked_range", qui permet de reproduire les itérations de la boucle. Il suffit de spécifier le "range" et la classe qui effectue le travail, le parallel_for fait tout le reste du travail. La simplicité d'écriture est vite perçue.</p>
<p>Enfin, il est important de noter que le type du range se retrouve dans la méthode de la classe qui surcharge l'opérateur ().</p>
<h2>Rendre notre classe Compute plus modulaire</h2>
<p>Avant d'aller plus, modifions notre classe Compute pour lui faire accepter un "range" de type différent plus facilement. Pour cela, rien de plus simple, il suffit de templater la classe, ce qui nous donne :<br />
<a href="http://software.intel.com/file/39814">Code source brut</a>.<br />
<a href="http://paste.pocoo.org/show/507010/">Code source avec coloration syntaxique</a>.</p>
<p>Avec ce template, il nous suffira de modifier les 2 lignes du main pour changer de "range".</p>
<h2>L'utilité des ranges</h2>
<p>Mais finalement, quelle est l'utilité des ranges ?<br />
C'est simple, ils servent à découper votre problème en sous-problèmes, et permettent après découpage, de lancer les threads sur ces sous-problèmes.</p>
<p>Dans notre cas, nous voulons paralléliser une boucle, et le "tbb::blocked_range" est la méthode parfaite pour faire cela, il découpe l'intervalle demandé (de 1 à 42) en deux, et cela récursivement tant que jugé utile.</p>
<p>Bien sûr, pour des problèmes plus complexes, pour des problèmes entraînant des appels récursifs par exemple, vous ne pourrez pas utiliser bêtement le parallel_for, et il vous faudra chercher du côté de "ranges" ou des "tasks". Nous n'aborderons pas ici les tasks néanmoins.</p>
<h2>Les méthodes à implémenter pour faire une classe "range"</h2>
<p>En parcourant <a href="http://threadingbuildingblocks.org/files/documentation/range_req.html">la documentation</a>, on s'aperçoit que notre classe "range" n'a pas besoin d'hériter d'une classe abstraite, mais doit implémenter certaines méthodes :</p>
<ul>
<li>Un constructeur par copie : R( const R&amp; )</li>
<li>Un constructeur qui découpe le problème en deux : R( R&amp; r, split )</li>
<li>Un destructeur : ~R()</li>
<li>Une méthode qui spécifie si on peut découper le "range" en deux : is_divisible</li>
<li>Une méthode qui spécifie si le "range" courant est vide ou non : empty</li>
</ul>
<h2>Définir son propre range</h2>
<p>Ce qui nous donne le code suivant dans notre cas :<br />
<a href="http://software.intel.com/file/39816">Code source brut</a>.<br />
<a href="http://paste.pocoo.org/show/507016/">Code source avec coloration syntaxique</a>.</p>
<p>Les méthodes très importantes sont : </p>
<ul>
<li>empty : permet de ne pas lancer de calcul sur une morceau vide du problème (après découpage, cela peut arriver).</li>
<li>is_divisible : cela permet de donner une taille minimale à un problème, et de s'assurer qu'il sera effectué par un seul thread. Cela évite le surcoût du lancement de trop de threads. Le paramètre "grain_size" est ainsi utilisé pour limiter s'assurer que le problème envoyé à un thread fait au minium 5 lignes (car pour 11 lignes, on découpe en 5 et 6 lignes).</li>
<li>do_split : qui est introduite dans le code mais n'est pas nécessaire, et qui se charge d'effectuer le découpage du problème en deux sous problèmes. C'est là que toute l'intelligence du découpage doit se faire.</li>
</ul>
<p>Et en intégrant notre "range" personnalisé à notre code précédent cela donne :<br />
<a href="http://software.intel.com/file/39815">Code source brut</a>.<br />
<a href="http://paste.pocoo.org/show/507017/">Code source avec coloration syntaxique</a>.</p>
<p>Vous noterez que la liste d'initialisation du constructeur définit end avant begin, ce qui est impératif à cause du constructeur qui découpe un problème, et de la fonction "do_split", qui modifie "r.end_".</p>
<h2>Remarques sur les performances</h2>
<p>Le découpage en sous-problèmes peut être appelé un nombre conséquent de fois, il est donc important que la fonction de découpage ne soit pas trop longue à s'exécuter.<br />
Pour l'anecdote, sur le concours Acceler8, en voulant découper un problème très équitablement, j'ai utilisé la fonction de calcul de racine carré (sqrt), qui a eu pour conséquence d’augmenter drastiquement le temps de calcul. Ce que je gagnais en répartissant plus équitablement le travail sur les cœurs de la machine, je le perdais en calcul de racine carré. Attention donc à ne pas vous faire avoir.</p>
<h2>Conclusion</h2>
<p>Vous avez vu ici un petit aperçu des "ranges" personnalisés, à vous d'adapter le code à vos besoin : ajouter des paramètres au constructeur, ainsi que découper intelligemment et efficacement.</p>
<h2>Sources :</h2>
<ul>
<li><a href="http://jfkbits.blogspot.com/2007/12/tbbs-parallelfor.html">http://jfkbits.blogspot.com/2007/12/tbbs-parallelfor.html</a></li>
<li><a href="http://threadingbuildingblocks.org/files/documentation/range_req.html">http://threadingbuildingblocks.org/files/documentation/range_req.html</a></li>
<li><a href="http://threadingbuildingblocks.org/files/documentation/a00266.html">http://threadingbuildingblocks.org/files/documentation/a00266.html</a> : il s'agit de l'implémentation du blocked_range. Cela reste une des meilleurs sources de documentation.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/fr-fr/blogs/2011/11/22/introduction-aux-ranges-des-tbb/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Acceler8 est fini, quelle expérience !</title>
		<link>http://software.intel.com/fr-fr/blogs/2011/08/01/acceler8-est-fini-quelle-exprience/</link>
		<comments>http://software.intel.com/fr-fr/blogs/2011/08/01/acceler8-est-fini-quelle-exprience/#comments</comments>
		<pubDate>Mon, 01 Aug 2011 08:30:47 +0000</pubDate>
		<dc:creator>farcellier</dc:creator>
				<category><![CDATA[Acceler8]]></category>
		<category><![CDATA[ISN France]]></category>
		<category><![CDATA[programmation parallèle]]></category>

		<guid isPermaLink="false">http://software.intel.com/fr-fr/blogs/2011/08/01/acceler8-est-fini-quelle-exprience/</guid>
		<description><![CDATA[Mardi matin, quelle surprise agréable de réceptionner les récompenses du concours acceler8. Ils venaient d'être expédiés la veille. Après 2 mois de travail intensif, c'est donc une page qui se tourne. Le concours acceler8 est bien fini. Ce fut un évènement intense et enrichissant. Nous ne pensions pas quand nous nous sommes lancés dans l'aventure [...]]]></description>
			<content:encoded><![CDATA[<p>Mardi matin, quelle surprise agréable de réceptionner les récompenses du concours acceler8.<br />
Ils venaient d'être expédiés la veille.</p>
<p><img alt="" src="https://lh3.googleusercontent.com/-fHuvZokFgLg/Ti6l9zoCP9I/AAAAAAAAAHI/79t5jEV_z2c/2011-07-26+10.37.40.jpg" class="aligncenter" width="640" height="480" /></p>
<p>Après 2 mois de travail intensif, c'est donc une page qui se tourne. Le concours acceler8 est bien fini. Ce fut un évènement intense et enrichissant. Nous ne pensions pas quand nous nous sommes lancés dans l'aventure que celle-ci nous mènerait si loin.</p>
<p>Le parallélisme est aujourd'hui sur toutes les lèvres. Cependant, nous nous attendions pas à découvrir un univers aussi<br />
riche. Le <a href="http://software.intel.com/fr-fr/articles/acceler8_recherche_nombres_premiers_particuliers_solution/">premier problème</a> sous son apparente simplicité s'est révélé bien plus corsé et pimenté que nous ne l'attendions. Jusqu'à la dernière demi heure, nous n'avons cessé d'y réfléchir et de chercher à améliorer le temps d'exécution de notre programme.</p>
<p>Le <a href="http://software.intel.com/fr-fr/articles/acceler8_recherche_nombres_premiers_particuliers_solution/">second</a>, plus difficile, nous a fait transpiré plus d'une fois. L'expérience du premier s'est révélé formatrice et c'est après un travail de longues haleines que nous sommes parvenus à fournir un programme efficace.</p>
<p>Ces 2 netbooks ne sont pas seulement une récompense. C'est un rappel des efforts que nous avons fourni pour s'améliorer en permanence. C'est aussi un rappel des efforts qu'ils nous restent encore à fournir pour nous améliorer.</p>
<p><img src="https://lh3.googleusercontent.com/-shvjwprfobY/Ti6l9zI2ntI/AAAAAAAAAHM/p0DhyNFnuKA/s640/2011-07-26+12.34.16.jpg" alt="2eme Eeepc du concours acceler8" /></p>
<p>En nous conviant à ce voyage sur le chemin du parallélisme, Intel nous a permis de faire un bout de chemin dans ce domaine. Tout au long de ce défi, ils nous ont guidé sur cette voie. Le partage que ce soit avec les organisateurs ou avec les autres concurrents a rendu cette expérience unique.</p>
<p>J'espère que d'autres défis sur des domaines aussi pointus seront organisés avec la même passion et la même volonté de permettre à<br />
des étudiants de découvrir des domaines parfois laissés en marge des programmes scolaires.</p>
<p>Fabien</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/fr-fr/blogs/2011/08/01/acceler8-est-fini-quelle-exprience/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

