https://software.intel.com/en-us/forums/topic/394248/feed
enI just ran my SB/IV dgemm and
https://software.intel.com/en-us/comment/1745459#comment-1745459
<a id="comment-1745459"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>I just ran my SB/IV dgemm and I measured 98.7 GFLOPs @ 3.4 GHz. If you scale it to 4.0 GHz then I get the same performance you quoted.. 116 GFOPs. Just another data point Sergey..</p>
<p>Perfwise</p>
</div></div></div>Mon, 29 Jul 2013 12:46:48 +0000perfwisecomment 1745459 at https://software.intel.comQuote:Sergey Kostrov wrote:
https://software.intel.com/en-us/comment/1745274#comment-1745274
<a id="comment-1745274"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p><strong>Quote:</strong></p><blockquote><em>Sergey Kostrov</em> wrote:
<p>>>>>Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL.<br /> >><br /> >>Igor, I've used Linpack and these numbers are more consistent with Intel's numbers</p>
<p>Igor, Did you get <strong>116 GFlops</strong> number from some website ( 1st ) or after real testing on a Haswell system ( 2nd )? In the 2nd case How many cores were used during the test?</p>
<p></p></blockquote>
<p>No I did not get the result off the web, I run the test myself using LinX AVX.</p>
<p>Number of cores is 4 (Haswell 4770K with HTT disabled).</p>
</div></div></div>Fri, 26 Jul 2013 19:31:44 +0000IgorLevickicomment 1745274 at https://software.intel.com>>?GEMM does more
https://software.intel.com/en-us/comment/1742931#comment-1742931
<a id="comment-1742931"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>>>?GEMM does more multiplications and additions by design:<br />
>><br />
>>C = alpha*A*B + beta*C</p>
<p>Would I consider that as a generic case? No. Have we reached the bottom of the ocean? Yes.</p>
</div></div></div>Wed, 10 Jul 2013 13:52:40 +0000SergeyKostrov@hotmail.comcomment 1742931 at https://software.intel.comSergey... the Kroneker
https://software.intel.com/en-us/comment/1742927#comment-1742927
<a id="comment-1742927"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Sergey... the Kroneker algorithm you point to says one of the matrices needs to be represented as a Kroneker product of 2 smaller matrices. While that may be applicable in some cases it is not generally applicable. </p>
</div></div></div>Wed, 10 Jul 2013 13:36:07 +0000perfwisecomment 1742927 at https://software.intel.comLet's finalize our discussion
https://software.intel.com/en-us/comment/1742883#comment-1742883
<a id="comment-1742883"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Let's finalize our discussion about matrix multiplication algorithms.</p>
<p>>>...DGEMM does the matrix operation of <strong>C = C + A * B</strong>...</p>
<p>?GEMM does more multiplications and additions by design:</p>
<p><strong>C = alpha*A*B + beta*C</strong></p>
<p>However, this is ?GEMM specific and I'm talking about a generic case, like <strong>C = A * B</strong>, and nothing else. I don't know any ISO-like standard accepted in industry regarding measuring performance of some software and everybody has its own solution(s). ( In reality I know how ISO 8001 works for X-Ray imaging software... Very-very strict... )</p>
<p>>>...I don't know what Kroneker Based DGEMM you're running or if you're quoting the timing for a Kroneker Product...</p>
<p>This is <strong>Not</strong> a regular <strong>Kronecker Product</strong> and that algorithm is described and I gave you a weblink earlier ( see one of my previous post ). The <strong>Kronecker Based algorithm for matrix multiplication</strong> is a really high performance algorithm implemented in Fortran by another software developer ( <strong>Vineet Y</strong> - <a href="http://software.intel.com/en-us/user/798062">http://software.intel.com/en-us/user/798062</a> ).</p>
<p>>>... I suspect you're not doing a traditional matrix mulitplication...</p>
<p>Once again, take a look at a document posted on the webpage I've mentioned and a description of the algorithm is available.</p>
</div></div></div>Wed, 10 Jul 2013 04:21:00 +0000SergeyKostrov@hotmail.comcomment 1742883 at https://software.intel.comSergey,
https://software.intel.com/en-us/comment/1742873#comment-1742873
<a id="comment-1742873"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Sergey,</p>
<p> The title of this post was Haswell GFLOPs. My interest is in "standardized BLAS routines" which drive LAPACK and many other high-performance applications. DGEMM does the matrix operation of C = C + A * B. When you update C, you have M * N addition operations which yields the formula I told you earlier which is 2 * M * K * N in the generic sense. That's the FLOP count for a traditional matrix mulitplication algorithm, and it's how the industry measures FLOPs. Now.. running SGEMM is completely not comparable to running DGEMM when comparing the time to do arithmetic, so it's just not comparable at all. I don't know what Kroneker Based DGEMM you're running or if you're quoting the timing for a Kroneker Product, which isn't DGEMM. DGEMM runs HPL which is what the scientific community uses to measure GFLOPs. So my recommendations to you are to standardize the problem you're running. Are all these results at the same precision and the same operation. MATMUL is doing what DGEMM is (close enough) and so is MKL (if you were running DGEMM rather than SGEMM). The other results you quote, if they're not DGEMM then they're not comparable to my results. It's just common sense. If your Kroneker operation is DGEMM, then you've got something interesting, but I suspect you're not doing a traditional matrix mulitplication and thus it's not a 1:1 correspondence and it's less interesting to me.</p>
<p>Perfwise</p>
</div></div></div>Wed, 10 Jul 2013 01:59:00 +0000perfwisecomment 1742873 at https://software.intel.comSergey.. in DGEMM.. you are
https://software.intel.com/en-us/comment/1742865#comment-1742865
<a id="comment-1742865"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Sergey.. in DGEMM.. you are performing the matrix computation..C = C + A x B. You didn't incude the addition of C. BLAS exists for a purpose to standardize Linear Algebra operations and that is my focus. So if you measured DGEMM Iin MKL it will be 1/2 as fast as DGEMM. HPL uses DGEMM.. and the title of this thread is Haswell GFLOPs. Is your Kroneker routine doing what DGEMM does... explicitly. I googled it but found that the Kroneker product is not dgemm</p>
<p><a href="http://www.google.com/url?sa=t&source=web&cd=1&ved=0CCgQFjAA&url=http%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FKronecker_product&ei=TLLcUZHzH5DRqwGD-oG4Bg&usg=AFQjCNELri6W15wGinM7HrPsZaEG89bZcA&sig2=epdFeLw6NAt5Xfa7dh6fSQ">http://www.google.com/url?sa=t&source=web&cd=1&ved=0CCgQFjAA&url=http%3A...</a></p>
<p>I must confess I don't understand what you are achieving in your study. My focus is purely in understanding what the DGEMM performance of Haswell is.. with whatever algorithm you use, so long as it does the matrix computation C = C + A x B.</p>
</div></div></div>Wed, 10 Jul 2013 01:05:34 +0000perfwisecomment 1742865 at https://software.intel.comAttached is a txt-file with
https://software.intel.com/en-us/comment/1742855#comment-1742855
<a id="comment-1742855"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Attached is a txt-file with test results. Thanks.</p>
<p> </p>
</div></div></div><section class="field field-name-field-attachments field-type-file field-label-above"><h2 class="field-label">Attachments: </h2><div class="field-items"><div class="field-item even"><table class="sticky-enabled">
<thead><tr><th>Attachment</th><th>Size</th> </tr></thead>
<tbody>
<tr class="odd"><td><span class="file"><a href="https://software.intel.com/sites/default/files/comment/1742855/matmultestresults.txt" class="button-cta">Download</a> <img class="file-icon" alt="" title="text/plain" src="/sites/all/themes/isn3/css/images/attachment_icon.png" /> <a href="https://software.intel.com/sites/default/files/comment/1742855/matmultestresults.txt" type="text/plain; length=5685">matmultestresults.txt</a></span></td><td>5.55 KB</td> </tr>
</tbody>
</table>
</div></div></section>Tue, 09 Jul 2013 23:28:31 +0000SergeyKostrov@hotmail.comcomment 1742855 at https://software.intel.com>>...my results are for
https://software.intel.com/en-us/comment/1742854#comment-1742854
<a id="comment-1742854"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>>>...my results are for double precision... while you appear to running single precision...</p>
<p>I used <strong>cblas_sgemm</strong> because I needed to compare performance for 4Kx4K and 8Kx8K cases on a Pentium 4 system with just 1GB of physical memory. 16Kx16K exceeds 2GB limitation for a 32-bit system.</p>
<p>I'll do a quick comparison of performance for <strong>cblas_sgemm</strong> and <strong>cblas_dgemm</strong> later, however it is not my top priority. Speaking about these 5 algorithms ( Classic, Strassen HBC, MKL's cblas_sgemm, Fortran's MATMUL and Kroneker Based ) I've finally done what I wanted to compare for a long time.</p>
<p>By the way, Fortran's MATMUL and Kronecker Based cases are using double precision floating point data types.</p>
</div></div></div>Tue, 09 Jul 2013 23:21:51 +0000SergeyKostrov@hotmail.comcomment 1742854 at https://software.intel.com>>...
https://software.intel.com/en-us/comment/1742850#comment-1742850
<a id="comment-1742850"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>>>...<br />
>>From my experience working with people in the industry, I define matrix multiplication FLOPs as that from traditional Linear Algebra,<br />
>>which is 2 * N^3, or to be precise <strong>2 * M * K * N</strong>...<br />
>>...</p>
<p>Take a <strong>A(4x4)</strong> * <strong>B(4x4)</strong> case and then count on a paper number of additions and multiplications. You should get <strong>112</strong> Floating Point Operations ( FPO ). Then calculate using your formula and you will get 2 * 4 * 4 * 4 = <strong>128</strong> and this doesn't look right.</p>
<p>This is why:</p>
<p>Let's say we have two matricies A[ MxN ] and B[ RxK ]. A product is C[ MxK ].</p>
<p>[ <strong>M</strong>xN ] * [ Rx<strong>K</strong> ] = [ <strong>MxK</strong> ]</p>
<p>If <strong>M=N=R=K</strong>, that is both matricies are square, then Total number of Floating Point Operations ( TFPO ) should be calculated as follows:</p>
<p><strong>TFPO = N^2 * ( 2*N - 1 )</strong></p>
<p>For example,</p>
<p>TFPO( 2x2 ) = 2^2 * ( 2*2 - 1 ) = 12<br />
TFPO( 3x3 ) = 3^2 * ( 3*2 - 1 ) = 45<br />
TFPO( 4x4 ) = 4^2 * ( 4*2 - 1 ) = 112<br />
TFPO( 5x5 ) = 5^2 * ( 5*2 - 1 ) = 225</p>
<p>and so on.</p>
</div></div></div>Tue, 09 Jul 2013 23:06:09 +0000SergeyKostrov@hotmail.comcomment 1742850 at https://software.intel.com