Haswell GFLOPS

72 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Sergey Kostrov's picture

>>...I measured the following today:
>>
>>SIZE^3, 1-core GFLOPs, 1-core TIME(s), 4-core GFLOPs, 4-core TIME(s)
>>
>>4000, 50.3, 2.54, 172.8, 0.74
>>
>>8000, 50.4, 20.3, 186.6, 5.49
>>
>>16000, 51.3, 159.7, 192.7, 42.5

Exactly for I wanted to see and I understood that tests are done on a Haswell system. I will post results for Ivy Bridge later today.

>>...These other methods you mention, entail lowered numerical accuracites, greater memory useage or difficulties in implementation
>>which give rise to fewer flops, but lower ipc and lower performance...

There are lots of speculative talks on different Internet forums about some matrix multiplication algorithms, like Coppersmith-Winograd and Strassen, especially by people who never tried to implement these two algorithms. I've implemented four different versions of Strassen algorithm and additional memory usage is by design of that recursive algorithm because it needs to partition source matricies down to some threshold limit. In theory this is 2x2 and in practice this is N / 8. For example, in case of 4096x4096 matricies this is 512x512 ( 4096 / 8 = 512 ).

iliyapolak's picture

Quote:

Sergey Kostrov wrote:

[ Iliya Polak wrote ]
>>...Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249 Gflops.

Iliya,

Where / how did you get that number?

Please explain because it is more than twice greater than the best number in the 2nd test of bronxzv ( 104.2632 GFlops ). Igor's number ( 116 GFlops ) is very close to bronxzv's number ( ~10% difference ).

Sorry for late answer(neverending problems with backup laptop)

Those numbers are theoretical peak bandwidth as @bronxzv explained in his answer.

iliyapolak's picture

It could be interesting to run that benchamrk under VTune.I am interested in seeing clockticks per instruction retired ratio.

Sergey Kostrov's picture

>>...It could be interesting to run that benchamrk under VTune...

Good luck with that.

Sergey Kostrov's picture

[ Tests Set #1 - Part A ]

*** Ivy Bridge CPU 2.50 GHz 1-core ***

[ 4096x4096 ]

Kroneker Based 1.93 seconds
MKL 3.68 seconds ( cblas_sgemm )
Strassen HBC 11.62 seconds
Fortran 20.67 seconds ( MATMUL )
Classic 31.36 seconds

[ 8192x8192 ]

Kroneker Based 11.26 seconds
MKL 29.34 seconds ( cblas_sgemm )
Strassen HBC 82.03 seconds
Fortran 138.57 seconds ( MATMUL )
Classic 252.05 seconds

[ 16384x16384 ]

Kroneker Based 81.52 seconds
MKL 237.76 seconds ( cblas_sgemm )
Strassen HBC 1160.80 seconds
Fortran 1685.09 seconds ( MATMUL )
Classic 2049.87 seconds

*** Haswell CPU 3.50 GHz 1-core ***

[ 4000x4000 ]

Perfwise 2.54 seconds

[ 8000x8000 ]

Perfwise 20.30 seconds

[ 16000x16000 ]

Perfwise 159.70 seconds

Sergey Kostrov's picture

[ Tests Set #1 - Part B - All Results Combined ]

[ 4096x4096 ]

Kroneker Based 1.93 seconds (*)
Perfwise 2.54 seconds ( 4000x4000 ) (**)
MKL 3.68 seconds ( cblas_sgemm ) (*)
Strassen HBC 11.62 seconds (*)
Fortran 20.67 seconds ( MATMUL ) (*)
Classic 31.36 seconds (*)

[ 8192x8192 ]

Kroneker Based 11.26 seconds (*)
Perfwise 20.30 seconds ( 8000x8000 ) (**)
MKL 29.34 seconds ( cblas_sgemm ) (*)
Strassen HBC 82.03 seconds (*)
Fortran 138.57 seconds ( MATMUL ) (*)
Classic 252.05 seconds

[ 16384x16384 ]

Kroneker Based 81.52 seconds (*)
Perfwise 159.70 seconds ( 16000x16000 ) (**)
MKL 237.76 seconds ( cblas_sgemm ) (*)
Strassen HBC 1160.80 seconds (*)
Fortran 1685.09 seconds ( MATMUL ) (*)
Classic 2049.87 seconds (*)

Note:

(*) Ivy Bridge CPU 2.50 GHz 1-core
(**) Haswell CPU 3.50 GHz 1-core

Sergey Kostrov's picture

[ Tests Set #2 - Part A ]

*** Ivy Bridge CPU 2.50 GHz 4-core ***

[ 4096x4096 ]

Kroneker Based 0.41 seconds
MKL 1.21 seconds ( cblas_sgemm )
Fortran 3.95 seconds ( MATMUL )
Classic 7.48 seconds
Strassen HBC N/A seconds

[ 8192x8192 ]

Kroneker Based 1.49 seconds ( 8100x8100 )
MKL 8.34 seconds ( cblas_sgemm )
Fortran 29.49 seconds ( MATMUL )
Classic 60.73 seconds
Strassen HBC N/A seconds

[ 16384x16384 ]

Kroneker Based 10.27 seconds
MKL 66.58 seconds ( cblas_sgemm )
Fortran 246.28 seconds ( MATMUL )
Classic 534.65 seconds
Strassen HBC N/A seconds

*** Haswell CPU 3.50 GHz 4-core ***

[ 4000x4000 ]

Perfwise 0.74 seconds

[ 8000x8000 ]

Perfwise 5.49 seconds

[ 16000x16000 ]

Perfwise 42.50 seconds

Sergey Kostrov's picture

[ Tests Set #2 - Part B - All Results Combined ]

[ 4096x4096 ]

Kroneker Based 0.41 seconds (*)
Perfwise 0.74 seconds ( 4000x4000 ) (**)
MKL 1.21 seconds ( cblas_sgemm ) (*)
Fortran 3.95 seconds ( MATMUL ) (*)
Classic 7.48 seconds (*)
Strassen HBC N/A seconds (***)

[ 8192x8192 ]

Kroneker Based 1.49 seconds ( 8100x8100 ) (*)
Perfwise 5.49 seconds ( 8000x8000 ) (**)
MKL 8.34 seconds ( cblas_sgemm ) (*)
Fortran 29.49 seconds ( MATMUL ) (*)
Classic 60.73 seconds (*)
Strassen HBC N/A seconds (***)

[ 16384x16384 ]

Kroneker Based 10.27 seconds (*)
Perfwise 42.50 seconds ( 16000x16000 ) (**)
MKL 66.58 seconds ( cblas_sgemm ) (*)
Fortran 246.28 seconds ( MATMUL ) (*)
Classic 534.65 seconds (*)
Strassen HBC N/A seconds (***)

Note:

(*) Ivy Bridge CPU 2.50 GHz 4-core
(**) Haswell CPU 3.50 GHz 4-core
(***) There is no Multi-threaded version

Sergey Kostrov's picture

Just for comparison these are results for Pentium 4...

[ Tests Set #3 ]

*** Pentium 4 CPU 1.60 GHz 1-core - Windows XP Professional 32-bit ***

[ 4096x4096 ]

MKL 31.23 seconds ( cblas_sgemm )
Strassen HBC 143.69 seconds (*)
Classic 183.66 seconds
Fortran N/A seconds ( MATMUL )
Kroneker Based N/A seconds

[ 8192x8192 ]

MKL 254.54 seconds ( cblas_sgemm )
Classic 1498.43 seconds
Strassen HBC N/A seconds
Fortran N/A seconds ( MATMUL )
Kroneker Based N/A seconds

[ 16384x16384 ]

Classic N/A seconds
MKL N/A seconds ( cblas_sgemm )
Strassen HBC N/A seconds
Fortran N/A seconds ( MATMUL )
Kroneker Based N/A seconds

Note:

(*) Excessive usage of Virtual Memory and significant negative performance impact

perfwise's picture

Sergey... my results are for double precision... while you appear to running single precision, at least in MKL since you are timing sgemm rather than dgemm.  You should have an apples to apples comparison.  If you are in sp for your timings then you should 1/2 my times since my GFOPs would double.

PErfwise

perfwise's picture

Also... freeze your freq to 2.5 GHz... to avoid including boosting.  I always do that to discern real arch ipc performance.  

Sergey Kostrov's picture

>>...
>>From my experience working with people in the industry, I define matrix multiplication FLOPs as that from traditional Linear Algebra,
>>which is 2 * N^3, or to be precise 2 * M * K * N...
>>...

Take a A(4x4) * B(4x4) case and then count on a paper number of additions and multiplications. You should get 112 Floating Point Operations ( FPO ). Then calculate using your formula and you will get 2 * 4 * 4 * 4 = 128 and this doesn't look right.

This is why:

Let's say we have two matricies A[ MxN ] and B[ RxK ]. A product is C[ MxK ].

[ MxN ] * [ RxK ] = [ MxK ]

If M=N=R=K, that is both matricies are square, then Total number of Floating Point Operations ( TFPO ) should be calculated as follows:

TFPO = N^2 * ( 2*N - 1 )

For example,

TFPO( 2x2 ) = 2^2 * ( 2*2 - 1 ) = 12
TFPO( 3x3 ) = 3^2 * ( 3*2 - 1 ) = 45
TFPO( 4x4 ) = 4^2 * ( 4*2 - 1 ) = 112
TFPO( 5x5 ) = 5^2 * ( 5*2 - 1 ) = 225

and so on.

Sergey Kostrov's picture

>>...my results are for double precision... while you appear to running single precision...

I used cblas_sgemm because I needed to compare performance for 4Kx4K and 8Kx8K cases on a Pentium 4 system with just 1GB of physical memory. 16Kx16K exceeds 2GB limitation for a 32-bit system.

I'll do a quick comparison of performance for cblas_sgemm and cblas_dgemm later, however it is not my top priority. Speaking about these 5 algorithms ( Classic, Strassen HBC, MKL's cblas_sgemm, Fortran's MATMUL and Kroneker Based ) I've finally done what I wanted to compare for a long time.

By the way, Fortran's MATMUL and Kronecker Based cases are using double precision floating point data types.

Sergey Kostrov's picture

Attached is a txt-file with test results. Thanks.

 

Attachments: 

AttachmentSize
Download matmultestresults.txt5.55 KB
perfwise's picture

Sergey.. in DGEMM.. you are performing the matrix computation..C = C + A x B.  You didn't incude the addition of C.  BLAS exists for a purpose to standardize Linear Algebra operations and that is my focus.  So if you measured DGEMM Iin MKL it will be 1/2 as fast as DGEMM.  HPL uses DGEMM.. and the title of this thread is Haswell GFLOPs.  Is your Kroneker routine doing what DGEMM does... explicitly.  I googled it but found that the Kroneker product is not dgemm

http://www.google.com/url?sa=t&source=web&cd=1&ved=0CCgQFjAA&url=http%3A...

I must confess I don't understand what you are achieving in your study.  My focus is purely in understanding what  the DGEMM performance of Haswell is.. with whatever algorithm you use, so long as it does the matrix computation C = C + A x B.

perfwise's picture

Sergey,

    The title of this post was Haswell GFLOPs.  My interest is in "standardized BLAS routines" which drive LAPACK and many other high-performance applications.  DGEMM does the matrix operation of C = C + A * B.  When you update C, you have M * N addition operations which yields the formula I told you earlier which is 2 * M * K * N in the generic sense.  That's the FLOP count for a traditional matrix mulitplication algorithm, and it's how the industry measures FLOPs.  Now.. running SGEMM is completely not comparable to running DGEMM when comparing the time to do arithmetic, so it's just not comparable at all.  I don't know what Kroneker Based DGEMM you're running or if you're quoting the timing for a Kroneker Product, which isn't DGEMM.  DGEMM runs HPL which is what the scientific community uses to measure GFLOPs.  So my recommendations to you are to standardize the problem you're running.  Are all these results at the same precision and the same operation.  MATMUL is doing what DGEMM is (close enough) and so is MKL (if you were running DGEMM rather than SGEMM).  The other results you quote, if they're not DGEMM then they're not comparable to my results.  It's just common sense.  If your Kroneker operation is DGEMM, then you've got something interesting, but I suspect you're not doing a traditional matrix mulitplication and thus it's not a 1:1 correspondence and it's less interesting to me.

Perfwise

Sergey Kostrov's picture

Let's finalize our discussion about matrix multiplication algorithms.

>>...DGEMM does the matrix operation of C = C + A * B...

?GEMM does more multiplications and additions by design:

C = alpha*A*B + beta*C

However, this is ?GEMM specific and I'm talking about a generic case, like C = A * B, and nothing else. I don't know any ISO-like standard accepted in industry regarding measuring performance of some software and everybody has its own solution(s). ( In reality I know how ISO 8001 works for X-Ray imaging software... Very-very strict... )

>>...I don't know what Kroneker Based DGEMM you're running or if you're quoting the timing for a Kroneker Product...

This is Not a regular Kronecker Product and that algorithm is described and I gave you a weblink earlier ( see one of my previous post ). The Kronecker Based algorithm for matrix multiplication is a really high performance algorithm implemented in Fortran by another software developer ( Vineet Y - http://software.intel.com/en-us/user/798062 ).

>>... I suspect you're not doing a traditional matrix mulitplication...

Once again, take a look at a document posted on the webpage I've mentioned and a description of the algorithm is available.

perfwise's picture

Sergey... the Kroneker algorithm you point to says one of the matrices needs to be represented as a Kroneker product of 2 smaller matrices.  While that may be applicable in some cases it is not generally applicable.  

Sergey Kostrov's picture

>>?GEMM does more multiplications and additions by design:
>>
>>C = alpha*A*B + beta*C

Would I consider that as a generic case? No. Have we reached the bottom of the ocean? Yes.

Igor Levicki's picture

Quote:

Sergey Kostrov wrote:

>>>>Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL.
>>
>>Igor, I've used Linpack and these numbers are more consistent with Intel's numbers

Igor, Did you get 116 GFlops number from some website ( 1st ) or after real testing on a Haswell system ( 2nd )? In the 2nd case How many cores were used during the test?

No I did not get the result off the web, I run the test myself using LinX AVX.

Number of cores is 4 (Haswell 4770K with HTT disabled).

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
perfwise's picture

I just ran my SB/IV dgemm and I measured 98.7 GFLOPs @ 3.4 GHz.  If you scale it to 4.0 GHz then I get the same performance you quoted.. 116 GFOPs.  Just another data point Sergey..

Perfwise

Pages

Login to leave a comment.