Dernières contributions
https://software.intel.com/fr-fr/recent/540788
frAfter the deadline
https://software.intel.com/fr-fr/forums/topic/278233
<p>I'm wondering how long we will have access to the benchmark service *after the deadline*<br />First of all, I am NOT trying to change the submitted code afterwards or to cheat in any way.</p>
<p>But since the last benchmark has been added (and due to our poor results with it) I started to code another algorithm. I'd like to know if I'll ever get the chance to compare it with other solutions on a 40cores machine after the contest is over ?</p>
mar, 15 Mai 12 17:51:28 -0700krahnack278233Invalid benchmark - AE12CB-13105960671904317035
https://software.intel.com/fr-fr/forums/topic/278278
<p>Hi,</p>
<p>We have a small issue with the last benchmark (the 3rd 40 core benchmark that was added a few minutes/hours ago).</p>
<pre>error on a 40-cores HT machine :
invalid benchmark
on a 40-cores HT machine, using 40 worker threads, running benchmark AE12CB-13105960671904317035:
invalid benchmark
<br /><br />Usually we have a more precise error. (Different output, etc.) Is it possible to have more details ? :)<br />(A friend of mine is having the same error.)</pre>lun, 14 Mai 12 08:36:19 -0700yodasbrain27827825 points fpr blog/forum etc. / 25 points fpr description of the work
https://software.intel.com/fr-fr/forums/topic/278300
<p>Greetings ! </p>
<p>It's done soon, and I've just read, there are 25 points for helpfull flood ;) or other good information.</p>
<p>Well, how can I know if a have enogh? more then that, do u read messages here, or on other forums too? </p>
<p>What is the order like ? what do we get points for ? </p>
<p>Other thing: 25 points for work description. I think I'm in the european contest, have I write my description in english/french? </p>
<p>Thank U =)</p>
dim, 13 Mai 12 11:40:41 -0700Fuchs278300New Acceler8 contest
https://software.intel.com/fr-fr/forums/topic/278786
<p>Hi !As the contest should be about to start in the next minutes/hours, it's maybe the time to open a new topic to share about the problem and to discuss with the other participants.Have some of the candidats already participate ? I did on the last contest and it was great !Good luck to everyone :)</p>
lun, 16 avr. 12 01:10:16 -0700candreolli278786Subarray Problem - A static NUMA-Aware approach
https://software.intel.com/fr-fr/blogs/2011/11/24/subarray-problem-a-static-numa-aware-approach
<br />
Subarray<br /><br /><div>The subarray problem on a n*m matrix is sequentially solved using an algorithm known as the Kadane 2D algorithm. This algorithm has a O(n²m) complexity. The sequential algorithm is written using 3 loops :<br /></div><br /><br /><pre><br />
for i in (0..n) // <- We parallelize that<br />
for j in (i..n)<br />
for k in (0..m)<br />
//do work with matrix[j][k]<br /></pre><br /><div>Our solution does not try to optimize the work performed inside the inner loop, so we skip the details of what is actually done inside. We chose to parallelize only the outer loop (index <strong>i</strong>).</div><br /><div>In order to parallelize the outer loop on K cores, we chose to split it into K tasks of equal duration. This approach has several advantages :<br /><ul><br /><li>The algorithm is very simple : there is no need to steal work or do complex load balancing between the K cores.</li><br /><br /><li>Each thread works on big continuous portions of the matrix, which maximizes cache usage.</li><br /><li>We know in advance what the threads are going to do and which data are going to be accessed so we can do smart NUMA optimizations.</li><br /></ul><br /></div><br /><br /><div>In this article, we explain: how we achieved to split the work into K equal tasks and how we optimized the treatment of these tasks.</div><br /><br /><br /><h2>1-Creating K tasks of equal duration</h2><br /><br /><table><br /><tr><br /><td><br /><img src="/sites/default/files/m/b/0/a/splitting.png" /><br /><strong>Fig. 1</strong> - <em>K=4 equal areas in a triangle</em><br /></td><br /><td style="padding-left:30px"><br /><br /><div>In order to split a for i (0..n) loop into K tasks, one often create K tasks [i=0..n/K],[i=n/K..2*n/K]...[i=(K-1)*n/K,K]. However, this simple solution does not work well in our case because the second loop (index <strong>j</strong>) starts at index <strong>i</strong>. This means that when i==0, n iterations are done in the second loop and when i==n-1 only 1 iteration is done in the second loop! The amount of work depending on <strong>i</strong> is represented in Figure 1. This figure represents an example of the work to be done on a 250*m matrix. When i==0, 250 iterations are done; when i==250, only 1 iteration is done. The total quantity of work to be done is equal to the area of the triangle.</div><br /><br /><div>Splitting the work into K equal tasks is equivalent to creating K equal areas inside the above mentioned triangle.</div><br /><div>For example, in Figure 1, representing the work to be done on a 250*m matrix, a close-to-optimal partionning is the following :<br /><ul><br /><li>Thread 0 doing i (0-34) = 7939 <strong>j</strong> iterations (area A1)</li><br /><li>Thread 1 doing i (34-74) = 7875 <strong>j</strong> iterations (area A2)</li><br /><li>Thread 2 doing i (74-125) = 7860 <strong>j</strong> iterations (area A3)</li><br /><li>Thread 3 doing i (125-250) = 7701 <strong>j</strong> iterations (area A4)</li><br /></ul><br />
With this partionning, there is at most a 3% difference in the number of iterations performed by each thread.<br /></div><br /><br /></td><br /></tr><br /></table><br /><br /><br /><br /><div><br />
In order to find the last index that a thread <strong>idx</strong> should process (e.g., 34 for thread 0 in the above example), we use the following formula:<br /><pre><br />
int last_index = 0;<br />
do {<br />
last_index++;<br />
} while((last_index)*(n) - (last_index+1)*(last_index)/2 < (idx+1) * n * (n - 1) / 2 / K);<br /></pre><br /><br /><p>Where n in the number of lines of the matrix, idx is the thread number and K the number of threads.<br /></p></div><br /><div><br />
This loop increments last_index until the amount of work done between i=0 and i=last_index is equal to idx*(total-work-to-be-done/number-of-workers). The calculation of "the amount of work done" is the calculation of the area of a trapeze. (E.g., on figure 1 the area A1, the work done by thread 0, represents the area of a trapeze.)<br /></div><br /><br /><div>Actually this could also be calculated with the following formula:<br /><pre><br />
last_index = 2*n - (√<span>(4*n*n-4*n+1)*K*K+((-4*<strong>idx</strong>-4)*n*n+(4*<strong>idx</strong>+4)*n)*K</span>+(2*n-1)*K)/(2*K);<br /></pre><br />
... but is is actually slower than doing the loop! (We think that the compiler is doing really smart things and that the loop is actually optimized and transformed into a much more efficient formula.)<br /><br /><br /><h2>2-NUMA optimizations</h2><br /><br /><div>As mentioned earlier, we also do NUMA optimizations. :) In order to improve the locality of the memory accessed by the threads, we have:<br /><ul><br /><li>Created a thread pool per NUMA node in the system. Each thread pool is totally independent from the others. Each thread pool is controlled by a master thread scheduled on the same NUMA node as the pool it controls.</li><br /><li>The creation of the K tasks is done in parallel by each master thread (actually each thread creates K/4 tasks since there is 4 NUMA nodes on the MTL).</li><br /><li>Before giving the tasks to its workers, each master thread <strong>duplicates the matrix on the local NUMA node</strong>. This ensures that, when the matrix does not fit in cache, the worker threads fetch data from their local memory. This optimization actually give a <strong>+25%</strong> performance boost at 40 cores. Lessons learned: pay attention to the data locality. ;)</li><br /><br /><li>(Note for those who might think that it is an incredible waste of memory: a 10K*10K matrix occupies 380MB in RAM. The MTL machines has 64GB. So one copy per node = a "waste" of 1.5GB = 2.3% of the memory of the machine = really negligible compared to the gain.)</li><br /></ul><br /></div><br /><h2>3-Other performance optimizations</h2><br /><div><br /><ul><br /><li>Our approach falls back on the sequential algorithm when the parallel algorithm is considered too costly. (E.g. the cost of duplicating the matrix and managing the thread pool cannot be amortized.)</li><br /><br /><li>Since the subarray algorithm is of O(n²m) complexity, it is sometimes worth to transpose the matrix before any computation, in order to have n<m. Experiments showed that transposing becomes worthy as soon as the difference in complexity is above 5K operations.</li><br /><li>Both reading and transposing the matrix are done in parallel using our thread pool. The input file is mapped in memory and each reader thread is responsible to parse 800Ko of the input file and creates a partial matrix corresponding to what it has read. All submatrices are then merged using a simple memcpy operation.</li><br /></ul><br /></div><br /><h2>4-Figure for nerds</h2><br /><div>Time to present some results!</div><br /><br /><div><br /><img src="/sites/default/files/m/7/2/5/speedup.png" /><br /><strong>Fig 2</strong> - <em>Speedup of our algorithm on a 10K*10K matrix</em><br /><br />
The algorithm has an near optimal speedup between 10 and 40 cores (x3.94) and between 1 and 40 cores (x36.8). This means that, according to Amdhal's law, more than 99.77% of our code is parallel. For those interested, it takes 5.9s at 40 cores to parse a 10K*10K matrix.<br /><br />
We think that the speedup seen by the Intel team might have been a little lower due to our static partitionning of data: on the final test 2 cores were fully loaded, which means that our partionning was no longer optimal. Nevertheless our solution seems to have behaved quite nicely even when (intuitively) load balancing could have been required.<br /></div><br /><br /><br /><h2>5-Code</h2><br /><div>Finally, here's a link to <a href="/sites/default/files/m/0/b/0/solution.zip" rel="nofollow">our code</a></div><br /><br /><br /></div>jeu, 24 nov. 11 15:19:44 -0800krahnack175171