<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Sun, 21 Mar 2010 00:51:23 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045/feed" rel="self" type="application/rss+xml" />
    <title>Intel Software Network - <![CDATA[ Performance Results ]]> feed</title>
    <link>http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Re: Performance Results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/430289">BradleyKuszmaul</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em>Here are some basic performance results on a 1.6GHz Dual-core Pentium Laptop:<br /> 1024x1024x1024 matrix<br /> MM (3 loops):  30s<br /> Strassen:         5.5s<br /> Intel MKL:         0.75s<br /> MKL 2threads: 0.45s<br /><br />MM (3 loops) is Clay's loop code.<br />Strassen is Clay's Strassen code.<br />Intel MKL: Is the Intel Math Kernel Library running DGEMM, on one core.<br />MKL 2threads: Is 2 threads<br /><br />The MKL is not using Strassen.  So if absolute performance is what we are interested in, this code needs a lot of work.<br /><br />It's been my experience that Strassen is at a disadvantage until the matrices get pretty large.  We may need to measure at least 4096x4096x4096 or larger to see an advantage over a good O(n^3)  implementation using blocked matrixes to minimize cache misses.<br /><br />-Bradley<br /><br /></em></div>
</div>
</div>
<br />What compiler are you using? I got similar poor performance of the standard loop using MSVC 2008, however with the Intel Compiler 11.1 on Suse 11.1 x86 with default optimizations ( no vectorization or threading ) I got the following for n=1024 m=1024 p=1024<br /><br />Standard matrix function done in 1.82 secs<br />Strassen matrix function done in 2.69 secs<br /><br />Machine spec Dell M1330 XPS 4Gb RAM, Core 2 Duo@2.20 Mhz T7500<br /><br />I have many more metrics to do, but I think your guesses at the threshold of the effectiveness of Strassen are pessimistic. Research has suggested that a matrix size as small as 16x16 is enough to show an improvement for optimized implementations ( see the paper referred to in the Resources thread ). <br /><br />Good luck with your solution,<br />Andrew.<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/showthread.php?t=68045</link>
      <pubDate>Tue, 01 Sep 2009 14:40:45 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: Performance Results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/440852">planetmarshall</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em>
<div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/430289">BradleyKuszmaul</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em>Here are some basic performance results on a 1.6GHz Dual-core Pentium Laptop:<br /> 1024x1024x1024 matrix<br /> MM (3 loops):  30s<br /> Strassen:         5.5s<br /> Intel MKL:         0.75s<br /> MKL 2threads: 0.45s<br /><br /></em></div>
</div>
</div>
<br />What compiler are you using? I got similar poor performance of the standard loop using MSVC 2008, however with the Intel Compiler 11.1 on Suse 11.1 x86 with default optimizations ( no vectorization or threading ) I got the following for n=1024 m=1024 p=1024<br /><br />Standard matrix function done in 1.82 secs<br />Strassen matrix function done in 2.69 secs<br /><br />Machine spec Dell M1330 XPS 4Gb RAM, Core 2 Duo@2.20 Mhz T7500<br /><br />I have many more metrics to do, but I think your guesses at the threshold of the effectiveness of Strassen are pessimistic. Research has suggested that a matrix size as small as 16x16 is enough to show an improvement for optimized implementations ( see the paper referred to in the Resources thread ). <br /><br />Good luck with your solution,<br />Andrew.<br /></em></div>
</div>
</div>
I'm using gcc 4.3.2  I guess I'm not surprised that the intel compiler did better.  In particular, I implemented a divide-and-conquer version of the O(n^3) algorithm, and it didn't do much better than the simple 3-nested loop.  I suspect that gcc is doing a bad job at compiling<br /> C[i][j] += A[i][k]*B[k][j]<br />I suspect that gcc would do better on a single-dimensional array with explicit index calculations.  I'll also try the intel compiler, if I get a chance.<br /><br />You may be right that I'm pessimistic about Strassen.  We'll see what the answers look like, and it has been a few years since I measured Strassen vs O(n^3).  I don't think that we can draw much conclusion from a 1990 study on a Cray YMP to apply to a 20-year-newer technology (Intel i7).  One big difference is that the Cray machines didn't have cache and their memory pipeline could keep the CPU busy.  Today, however, cache misses are the dominant performance problem.<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/showthread.php?t=68045</link>
      <pubDate>Tue, 01 Sep 2009 15:08:04 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: Performance Results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/440852">planetmarshall</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em> <br />What compiler are you using? I got similar poor performance of the standard loop using MSVC 2008, however with the Intel Compiler 11.1 on Suse 11.1 x86 with default optimizations ( no vectorization or threading ) I got the following for n=1024 m=1024 p=1024<br /><br />Standard matrix function done in 1.82 secs<br />Strassen matrix function done in 2.69 secs<br /><br />Machine spec Dell M1330 XPS 4Gb RAM, Core 2 Duo@2.20 Mhz T7500<br /><br />I have many more metrics to do, but I think your guesses at the threshold of the effectiveness of Strassen are pessimistic. Research has suggested that a matrix size as small as 16x16 is enough to show an improvement for optimized implementations ( see the paper referred to in the Resources thread ). <br /><br />Good luck with your solution,<br />Andrew.<br /></em></div>
</div>
</div>
<br />I admit Intel compilers are faster than gcc or VC++. I got similar results as BradleyKuszmaul's on my Core 2 Duo@2 GHz / Ubuntu 8.x / Intel C++ 11.0x<br /><br />BTW, Do you mean the tripple for-loop function by "Standard matrix function?" I don't get how standard matrix function is way faster than Strassen when m=n=p=1024, which is large enough to show strassen's advantage, but is not so large to cause memory deficiencies.<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/showthread.php?t=68045</link>
      <pubDate>Tue, 01 Sep 2009 15:35:06 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: Performance Results</title>
      <description><![CDATA[ CPU 1.8GHz Dual-Core  1G Memory<br /><br />1024x1024x1024 Matrix<br />MM (3 loops):   35.40s<br />Strassen:         5.58s<br />My Code 2threads: 0.80s<br /><br />There are gaps with MKL ]]></description>
      <link>http://software.intel.com/en-us/forums/showthread.php?t=68045</link>
      <pubDate>Wed, 02 Sep 2009 06:36:32 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: Performance Results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/285737">邓辉</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em>CPU 1.8GHz Dual-Core  1G Memory<br /><br />1024x1024x1024 Matrix<br />MM (3 loops):   35.40s<br />Strassen:         5.58s<br />My Code 2threads: 0.80s<br /><br />There are gaps with MKL</em></div>
</div>
</div>
<br />Wow that's quite fast. Now I am getting similar result. You might have more improved your code during the last week though.<br /><br />CPU: 1 GHz dual-core 2GB Mem.<br /><br />1024x1024x1024 Matrix<br />2 threads: 1.54 sec<br /><br />Yeah, It's a 2GHz machine, but somehow it sticks to 1GHz when running Linux. It's like driving highway with transmission stuck in the first gear. 8-)<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/showthread.php?t=68045</link>
      <pubDate>Thu, 10 Sep 2009 09:53:45 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: Performance Results</title>
      <description><![CDATA[ On a two-socket Nehalem, I'm seeing about 75GFLOPS on the cubic algorithm, I've seen an "effective" 98 GFLOPS on Strassen.  Thats 11.2s for an 8192x8192x8192 matrix multiply.<br /><br />I'm seeing about 0.036s for 1024x1024x1024 matrix multiply (that's only 59 GFLOPS effective), and the cubic algorithm is faster (70 GFLOPS).<br /><br />I cannot do 16Kx16Kx16K matrices since I have only 12GB of RAM.  So Strassen is barely better for large matrices.<br /><br />-Bradley<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/showthread.php?t=68045</link>
      <pubDate>Thu, 10 Sep 2009 10:44:34 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: Performance Results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/413257">iarchitect</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em>
<div style="margin:0px;"></div>
<br />Wow that's quite fast. Now I am getting similar result. You might have more improved your code during the last week though.<br /><br />CPU: 1 GHz dual-core 2GB Mem.<br /><br />1024x1024x1024 Matrix<br />2 threads: 1.54 sec<br /><br />Yeah, It's a 2GHz machine, but somehow it sticks to 1GHz when running Linux. It's like driving highway with transmission stuck in the first gear. 8-)<br /></em></div>
</div>
</div>
I wonder if using only 2 threads will benefit at all from the test environments 4 and 8 core systems. I used more threads than 2 just in the hope that it would scale better on the test machines.  .8 sec and 1.54 sec sure beat my 2.00 sec times for that size array. Can't wait to see how you all did that.  But I might catch you by scaling. :-)<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/showthread.php?t=68045</link>
      <pubDate>Thu, 10 Sep 2009 10:47:37 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: Performance Results</title>
      <description><![CDATA[ <div style="margin:0px;">Typical console output of the code running on a Dell T3400 with an Intel Core2 Quad CPU Q6700 @ 2.66 GHz (32 bit Vista) <br /></div>
<br /><br />Strassen_TBB.exe 1024 1024 1024<br />M: 1024, N: 1024, P: 1024<br /><br />Execute Standard matmult<br />Done in: 17.83 secs<br /><br />Execute Parallel Standard matmult.<br />Done in: 8.10 secs<br /><br />Results OKAY<br /><br />Execute Strassen matrix function as supplied by Intel.<br />Done in: 1.56<br />Results OKAY<br /><br /><br />Execute Parallel Strassen matrix function.<br />Done in: 0.50<br />Results OKAY ]]></description>
      <link>http://software.intel.com/en-us/forums/showthread.php?t=68045</link>
      <pubDate>Fri, 11 Sep 2009 12:35:31 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: Performance Results</title>
      <description><![CDATA[ 4 Cores test output<br /><br />Processor (CPU):   Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz<br /> Speed:  2,399.92 MHz<br /> Cores:  4<br /> Memory Information<br />Total memory (RAM):  2.0 GB<br />openSUSE 11.1 (x86_64)<br /><br /><br />./strassenomp.run 4096 4096 4096<br />Executing Standard matrix multiply      <br />Standard matrix multiply       done in   44.58 secs (equivalent to 3082.99 MFLOPS)<br />Executing Strassen matrix multiply      <br /><br />===== Detected 4 available threads<br />Strassen matrix multiply       done in   15.74 secs (equivalent to 8731.11 MFLOPS)<br /><br />OKAY<br /><br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/showthread.php?t=68045</link>
      <pubDate>Fri, 11 Sep 2009 16:24:20 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: Performance Results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="margin-top: 5px; width: 100%;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/285737">邓辉</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em>CPU 1.8GHz Dual-Core  1G Memory<br /><br />1024x1024x1024 Matrix<br />MM (3 loops):   35.40s<br />Strassen:         5.58s<br />My Code 2threads: 0.80s<br /><br />There are gaps with MKL</em></div>
</div>
</div>
<br />Very fast!!<br />MM              16.5s<br />Strassen        2.9s<br />My code (only 2 threads)  0.77s<br /><br /><br />CPU 2.4 Intel GHz Quad Core<br />1024x1024x1024<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/showthread.php?t=68045</link>
      <pubDate>Fri, 11 Sep 2009 16:29:35 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/strassens-algorithm/topic/68045/</guid>
      <category>ISN General</category>
    </item>
  </channel></rss>