<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Intel® Software Network (FR) &#187; neshone</title>
	<atom:link href="http://software.intel.com/fr-fr/blogs/author/neshone/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/fr-fr/blogs</link>
	<description></description>
	<lastBuildDate>Mon, 14 May 2012 06:49:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Maximum Subarray Problem - Simple Parallelization and Optimizations</title>
		<link>http://software.intel.com/fr-fr/blogs/2011/12/22/maximum-subarray-problem-simple-parallelization-and-optimizations/</link>
		<comments>http://software.intel.com/fr-fr/blogs/2011/12/22/maximum-subarray-problem-simple-parallelization-and-optimizations/#comments</comments>
		<pubDate>Thu, 22 Dec 2011 13:26:53 +0000</pubDate>
		<dc:creator>neshone</dc:creator>
				<category><![CDATA[Acceler8]]></category>
		<category><![CDATA[programmation parallèle]]></category>

		<guid isPermaLink="false">http://software.intel.com/fr-fr/blogs/2011/12/22/maximum-subarray-problem-simple-parallelization-and-optimizations/</guid>
		<description><![CDATA[University of Novi Sad Faculty of Technical Sciences, Department of Computing and Control authors: Predrag Ilkic, Nenad Jovanovic Date: October 15th 2011 - November 15th 2011 Introduction: This article is an explanation of our method for solving the maximumim subarray problem during the Intel Acceler8 contest. The team consisted of one fourth year university student [...]]]></description>
			<content:encoded><![CDATA[<p><span lang="EN"></p>
<p align="center"><span lang="EN"><font size="2">University of Novi Sad</font>
</p>
<p align="center"><font size="2">Faculty of Technical Sciences, Department of Computing and Control</font></p>
<p align="center"><font size="2">authors: Predrag Ilkic, Nenad Jovanovic</font></p>
<p align="center"><font size="2">Date: October 15th 2011 - November 15th 2011</font></p>
<h2 class="MsoNormal"><span><font size="5">Introduction:</font></span><span><span lang="EN"></h2>
<p>This article is an explanation of our method for solving the maximumim subarray problem during the Intel Acceler8 contest. The team consisted of one fourth year university student and one fourth grade high school student.</p>
<p>The problem was the following: in the given matrix, find the subarray (corner coordinates) that has the maximum sum of elements. The algorithm used for the solution was the well-known Kadane's algorithm adapted for two-dimensional<br />
arrays. We started the project using the OpenMP, but at one point we switched to pthreads because we got much better results. The algorithm wasn't too complex for implementation so the transition went quite smoothly. The only downside of<br />
pthread solution was that we switched to it much too late so we never had the time to try out some of our ideas that failed with OpenMP. The compiler used was gcc, no other compilers were tested.</p>
<p><span lang="EN"></p>
<h2>1 - matrix parsing</span></span></span></span></span></h2>
<p><span lang="EN"></p>
<p>Matrix parsing and loading was done sequentially. While in the OpenMP phase of the project, we tried to parallelize the matrix parsing and loading, but without success. The results were pretty much the same at best, so the solution<br />
was dropped. No parsing parallelization was attempted with pthreads due to the lack of time.</p>
<p>Matrix parsing was done using tha standard fread() function. One chunk at the time was loaded into memory. It was parsed, firstly, by counting all of the spaces to get the column count and then by counting the linebreaks to get the<br />
row count. After acquiring the matrix size, we allocated the appropriate amount of memory. The matrix was then loaded into memory in an appropriate way depending on the number of rows and columns (whether or not rows&gt;columns -<br />
this effects the complexity of the solution which is calculated O(rows*rows*columns)). This enabled us not to waste time on matrix transposing later. The loading was done using our own integer reading. We found it much<br />
faster compared to standard fscanf(). The matrix from the file itself was never actually loaded into memory. We just used the integers read to generate the prefix sum matrix. Only the prefix sum matrix was processed in this<br />
solution.</p>
<p></span><span lang="EN"></p>
<h2>2 - parallelization and optimizations</h2>
<p><span lang="EN"></p>
<p>The whole parallelization process was done in OpenMP and later literally translated to pthreads. Since the Kadane algorithm consists of three nested for loops, the parallelization was done on the outer most for loop to decrease<br />
overhead and increase data locality and cache hit rate. All interations in the outer loop are completely independant so parallelizing the algorithm in this way enables great scalability. Every iteration is picked up for processing by one of<br />
the worker threads. Every iteration is processed one by one until all iterations are processed. We tried different types of scheduling with OpenMP. Static scheduling was the least effective and dynamic and guided scheduling gave very<br />
similar results so we preoceeded with dynamic scheduling which has just been explained. After the transition to pthreads, dynamic scheduling was kept and no other scheduling types were tried due to before mentioned lack of time. Every<br />
worker thread collects it's own results from which is later picked the final result. This was done to eliminate any thread synchronization that might slow down performance.</p>
<p><span lang="EN"></p>
<h2>3 - results</h2>
<p><span lang="EN"></p>
<p>For starters, here are some matrix parsing and loading timings:</p>
<table border="1" cellspacing="0" cellpadding="3" width="20%" align="left">
<tr>
<td width="50%" align="center">matrix size</td>
<td align="center">timing</td>
</tr>
<tr>
<td width="50%" align="center">1000x1000</td>
<td align="center">39.35 ms</td>
</tr>
<tr>
<td width="50%" align="center">2000x2000</td>
<td align="center">153.2 ms</td>
</tr>
<tr>
<td width="50%" align="center">5000x5000</td>
<td align="center">984.52 ms</td>
</tr>
<tr>
<td width="50%" align="center">8000x8000</td>
<td align="center">2466.63 ms</td>
</tr>
</table>
<p></P></span></p>
<h2></span>&nbsp;</h2>
<h2></span></span>&nbsp;</h2>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><span lang="EN"></p>
<p>And here are the timings of the whole algorithm:</p>
<p></span></p>
<table border="1" cellspacing="0" cellpadding="3" width="30%" align="left">
<tr>
<td align="center">matrix size</td>
<td width="33%" align="center">number of cores</td>
<td width="33%" align="center">timing</td>
</tr>
<tr>
<td align="center">1000x1000</td>
<td width="33%" align="center">1</td>
<td width="33%" align="center">1.523 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">8</td>
<td width="33%" align="center">0.223 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">16</td>
<td width="33%" align="center">0.132 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">24</td>
<td width="33%" align="center">0.104 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">32</td>
<td width="33%" align="center">0.089 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">40</td>
<td width="33%" align="center">0.082 s</td>
</tr>
<tr>
<td align="center">2000x2000</td>
<td width="33%" align="center">1</td>
<td width="33%" align="center">12.093 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">8</td>
<td width="33%" align="center">1.683 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">16</td>
<td width="33%" align="center">0.891 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">24</td>
<td width="33%" align="center">0.646 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">32</td>
<td width="33%" align="center">0.525 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">40</td>
<td width="33%" align="center">0.465 s</td>
</tr>
<tr>
<td align="center">5000x5000</td>
<td width="33%" align="center">20</td>
<td width="33%" align="center">11.886 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">30</td>
<td width="33%" align="center">7.584 s</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">40</td>
<td width="33%" align="center">6.041 s</td>
</tr>
<tr>
<td align="center">8000x8000</td>
<td width="33%" align="center">20</td>
<td width="33%" align="center">52.050</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">30</td>
<td width="33%" align="center">34.685</td>
</tr>
<tr>
<td align="center">&nbsp;</td>
<td width="33%" align="center">40</td>
<td width="33%" align="center">26.234</td>
</tr>
</table>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><font size="1"></font>&nbsp;</p>
<p><span lang="EN"></p>
<p>No fancy graphs or pictures <img src='http://software.intel.com/fr-fr/blogs/wordpress/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><span lang="EN"></p>
<h2>4 - conclusion</h2>
<p><span lang="EN"></p>
<p>As the results show, for smaller matrices, speed up drops greatly because of the large part of execution that the sequential part of the code takes. As the matrix gets larger, the speed up grows to almost linear.</p>
<p>   The code, makefile and the readme file from the contest are attached. The password for the archive is "secret". We tried to make the code well commented, so it should be easy to understand. You can download all of it <a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2011/12/solution.zip">here</a>.</p>
<p>For the end, we'd like to say that the contest was a pleasure, a great experience and a great chance to learn something new and useful as well as try it out properly(40 core MTL was great:)).</p>
<p></span></span></span></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/fr-fr/blogs/2011/12/22/maximum-subarray-problem-simple-parallelization-and-optimizations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

