# Haswell GFLOPS

For more complete information about compiler optimizations, see our Optimization Notice.

For more complete information about compiler optimizations, see our Optimization Notice.

>>...I measured the following today:

>>

>>SIZE^3, 1-core GFLOPs, 1-core TIME(s), 4-core GFLOPs, 4-core TIME(s)

>>

>>4000, 50.3, 2.54, 172.8, 0.74

>>

>>8000, 50.4, 20.3, 186.6, 5.49

>>

>>16000, 51.3, 159.7, 192.7, 42.5

Exactly for I wanted to see and I understood that tests are done on a Haswell system. I will post results for Ivy Bridge later today.

>>...These other methods you mention, entail lowered numerical accuracites, greater memory useage or difficulties in implementation

>>which give rise to fewer flops, but lower ipc and lower performance...

There are lots of speculative talks on different Internet forums about some matrix multiplication algorithms, like Coppersmith-Winograd and Strassen, especially by people who never tried to implement these two algorithms. I've implemented four different versions of Strassen algorithm and additional memory usage is by design of that recursive algorithm because it needs to partition source matricies down to some threshold limit. In theory this is 2x2 and in practice this is N / 8. For example, in case of 4096x4096 matricies this is 512x512 ( 4096 / 8 = 512 ).

Quote:Sorry for late answer(neverending problems with backup laptop)

Those numbers are theoretical peak bandwidth as @bronxzv explained in his answer.

It could be interesting to run that benchamrk under VTune.I am interested in seeing clockticks per instruction retired ratio.

>>...It could be interesting to run that benchamrk under VTune...

Good luck with that.

[ Tests Set #1 - Part A ]*** Ivy Bridge CPU 2.50 GHz 1-core ***[ 4096x4096 ]Kroneker Based 1.93 seconds

MKL 3.68 seconds ( cblas_sgemm )

Strassen HBC 11.62 seconds

Fortran 20.67 seconds ( MATMUL )

Classic 31.36 seconds

[ 8192x8192 ]Kroneker Based 11.26 seconds

MKL 29.34 seconds ( cblas_sgemm )

Strassen HBC 82.03 seconds

Fortran 138.57 seconds ( MATMUL )

Classic 252.05 seconds

[ 16384x16384 ]Kroneker Based 81.52 seconds

MKL 237.76 seconds ( cblas_sgemm )

Strassen HBC 1160.80 seconds

Fortran 1685.09 seconds ( MATMUL )

Classic 2049.87 seconds

*** Haswell CPU 3.50 GHz 1-core ***[ 4000x4000 ]Perfwise 2.54 seconds

[ 8000x8000 ]Perfwise 20.30 seconds

[ 16000x16000 ]Perfwise 159.70 seconds

[ Tests Set #1 - Part B - All Results Combined ][ 4096x4096 ]Kroneker Based 1.93 seconds (*)

Perfwise 2.54 seconds ( 4000x4000 ) (**)

MKL 3.68 seconds ( cblas_sgemm ) (*)

Strassen HBC 11.62 seconds (*)

Fortran 20.67 seconds ( MATMUL ) (*)

Classic 31.36 seconds (*)

[ 8192x8192 ]Kroneker Based 11.26 seconds (*)

Perfwise 20.30 seconds ( 8000x8000 ) (**)

MKL 29.34 seconds ( cblas_sgemm ) (*)

Strassen HBC 82.03 seconds (*)

Fortran 138.57 seconds ( MATMUL ) (*)

Classic 252.05 seconds

[ 16384x16384 ]Kroneker Based 81.52 seconds (*)

Perfwise 159.70 seconds ( 16000x16000 ) (**)

MKL 237.76 seconds ( cblas_sgemm ) (*)

Strassen HBC 1160.80 seconds (*)

Fortran 1685.09 seconds ( MATMUL ) (*)

Classic 2049.87 seconds (*)

Note:(*) Ivy Bridge CPU 2.50 GHz 1-core

(**) Haswell CPU 3.50 GHz 1-core

[ Tests Set #2 - Part A ]*** Ivy Bridge CPU 2.50 GHz 4-core ***[ 4096x4096 ]Kroneker Based 0.41 seconds

MKL 1.21 seconds ( cblas_sgemm )

Fortran 3.95 seconds ( MATMUL )

Classic 7.48 seconds

Strassen HBC N/A seconds

[ 8192x8192 ]Kroneker Based 1.49 seconds ( 8100x8100 )

MKL 8.34 seconds ( cblas_sgemm )

Fortran 29.49 seconds ( MATMUL )

Classic 60.73 seconds

Strassen HBC N/A seconds

[ 16384x16384 ]Kroneker Based 10.27 seconds

MKL 66.58 seconds ( cblas_sgemm )

Fortran 246.28 seconds ( MATMUL )

Classic 534.65 seconds

Strassen HBC N/A seconds

*** Haswell CPU 3.50 GHz 4-core ***[ 4000x4000 ]Perfwise 0.74 seconds

[ 8000x8000 ]Perfwise 5.49 seconds

[ 16000x16000 ]Perfwise 42.50 seconds

[ Tests Set #2 - Part B - All Results Combined ][ 4096x4096 ]Kroneker Based 0.41 seconds (*)

Perfwise 0.74 seconds ( 4000x4000 ) (**)

MKL 1.21 seconds ( cblas_sgemm ) (*)

Fortran 3.95 seconds ( MATMUL ) (*)

Classic 7.48 seconds (*)

Strassen HBC N/A seconds (***)

[ 8192x8192 ]Kroneker Based 1.49 seconds ( 8100x8100 ) (*)

Perfwise 5.49 seconds ( 8000x8000 ) (**)

MKL 8.34 seconds ( cblas_sgemm ) (*)

Fortran 29.49 seconds ( MATMUL ) (*)

Classic 60.73 seconds (*)

Strassen HBC N/A seconds (***)

[ 16384x16384 ]Kroneker Based 10.27 seconds (*)

Perfwise 42.50 seconds ( 16000x16000 ) (**)

MKL 66.58 seconds ( cblas_sgemm ) (*)

Fortran 246.28 seconds ( MATMUL ) (*)

Classic 534.65 seconds (*)

Strassen HBC N/A seconds (***)

Note:(*) Ivy Bridge CPU 2.50 GHz 4-core

(**) Haswell CPU 3.50 GHz 4-core

(***) There is no Multi-threaded version

Just for comparison these are results for Pentium 4...

[ Tests Set #3 ]*** Pentium 4 CPU 1.60 GHz 1-core - Windows XP Professional 32-bit ***[ 4096x4096 ]MKL 31.23 seconds ( cblas_sgemm )

Strassen HBC 143.69 seconds (*)

Classic 183.66 seconds

Fortran N/A seconds ( MATMUL )

Kroneker Based N/A seconds

[ 8192x8192 ]MKL 254.54 seconds ( cblas_sgemm )

Classic 1498.43 seconds

Strassen HBC N/A seconds

Fortran N/A seconds ( MATMUL )

Kroneker Based N/A seconds

[ 16384x16384 ]Classic N/A seconds

MKL N/A seconds ( cblas_sgemm )

Strassen HBC N/A seconds

Fortran N/A seconds ( MATMUL )

Kroneker Based N/A seconds

Note:(*) Excessive usage of Virtual Memory and significant negative performance impact

Sergey... my results are for double precision... while you appear to running single precision, at least in MKL since you are timing sgemm rather than dgemm. You should have an apples to apples comparison. If you are in sp for your timings then you should 1/2 my times since my GFOPs would double.

PErfwise

Also... freeze your freq to 2.5 GHz... to avoid including boosting. I always do that to discern real arch ipc performance.

>>...

>>From my experience working with people in the industry, I define matrix multiplication FLOPs as that from traditional Linear Algebra,

>>which is 2 * N^3, or to be precise

2 * M * K * N...>>...

Take a

A(4x4)*B(4x4)case and then count on a paper number of additions and multiplications. You should get112Floating Point Operations ( FPO ). Then calculate using your formula and you will get 2 * 4 * 4 * 4 =128and this doesn't look right.This is why:

Let's say we have two matricies A[ MxN ] and B[ RxK ]. A product is C[ MxK ].

[

MxN ] * [ RxK] = [MxK]If

M=N=R=K, that is both matricies are square, then Total number of Floating Point Operations ( TFPO ) should be calculated as follows:TFPO = N^2 * ( 2*N - 1 )For example,

TFPO( 2x2 ) = 2^2 * ( 2*2 - 1 ) = 12

TFPO( 3x3 ) = 3^2 * ( 3*2 - 1 ) = 45

TFPO( 4x4 ) = 4^2 * ( 4*2 - 1 ) = 112

TFPO( 5x5 ) = 5^2 * ( 5*2 - 1 ) = 225

and so on.

>>...my results are for double precision... while you appear to running single precision...

I used

cblas_sgemmbecause I needed to compare performance for 4Kx4K and 8Kx8K cases on a Pentium 4 system with just 1GB of physical memory. 16Kx16K exceeds 2GB limitation for a 32-bit system.I'll do a quick comparison of performance for

cblas_sgemmandcblas_dgemmlater, however it is not my top priority. Speaking about these 5 algorithms ( Classic, Strassen HBC, MKL's cblas_sgemm, Fortran's MATMUL and Kroneker Based ) I've finally done what I wanted to compare for a long time.By the way, Fortran's MATMUL and Kronecker Based cases are using double precision floating point data types.

Attached is a txt-file with test results. Thanks.

## Attachments:

Sergey.. in DGEMM.. you are performing the matrix computation..C = C + A x B. You didn't incude the addition of C. BLAS exists for a purpose to standardize Linear Algebra operations and that is my focus. So if you measured DGEMM Iin MKL it will be 1/2 as fast as DGEMM. HPL uses DGEMM.. and the title of this thread is Haswell GFLOPs. Is your Kroneker routine doing what DGEMM does... explicitly. I googled it but found that the Kroneker product is not dgemm

http://www.google.com/url?sa=t&source=web&cd=1&ved=0CCgQFjAA&url=http%3A...

I must confess I don't understand what you are achieving in your study. My focus is purely in understanding what the DGEMM performance of Haswell is.. with whatever algorithm you use, so long as it does the matrix computation C = C + A x B.

Sergey,

The title of this post was Haswell GFLOPs. My interest is in "standardized BLAS routines" which drive LAPACK and many other high-performance applications. DGEMM does the matrix operation of C = C + A * B. When you update C, you have M * N addition operations which yields the formula I told you earlier which is 2 * M * K * N in the generic sense. That's the FLOP count for a traditional matrix mulitplication algorithm, and it's how the industry measures FLOPs. Now.. running SGEMM is completely not comparable to running DGEMM when comparing the time to do arithmetic, so it's just not comparable at all. I don't know what Kroneker Based DGEMM you're running or if you're quoting the timing for a Kroneker Product, which isn't DGEMM. DGEMM runs HPL which is what the scientific community uses to measure GFLOPs. So my recommendations to you are to standardize the problem you're running. Are all these results at the same precision and the same operation. MATMUL is doing what DGEMM is (close enough) and so is MKL (if you were running DGEMM rather than SGEMM). The other results you quote, if they're not DGEMM then they're not comparable to my results. It's just common sense. If your Kroneker operation is DGEMM, then you've got something interesting, but I suspect you're not doing a traditional matrix mulitplication and thus it's not a 1:1 correspondence and it's less interesting to me.

Perfwise

Let's finalize our discussion about matrix multiplication algorithms.

>>...DGEMM does the matrix operation of

C = C + A * B...?GEMM does more multiplications and additions by design:

C = alpha*A*B + beta*CHowever, this is ?GEMM specific and I'm talking about a generic case, like

C = A * B, and nothing else. I don't know any ISO-like standard accepted in industry regarding measuring performance of some software and everybody has its own solution(s). ( In reality I know how ISO 8001 works for X-Ray imaging software... Very-very strict... )>>...I don't know what Kroneker Based DGEMM you're running or if you're quoting the timing for a Kroneker Product...

This is

Nota regularKronecker Productand that algorithm is described and I gave you a weblink earlier ( see one of my previous post ). TheKronecker Based algorithm for matrix multiplicationis a really high performance algorithm implemented in Fortran by another software developer (Vineet Y- http://software.intel.com/en-us/user/798062 ).>>... I suspect you're not doing a traditional matrix mulitplication...

Once again, take a look at a document posted on the webpage I've mentioned and a description of the algorithm is available.

Sergey... the Kroneker algorithm you point to says one of the matrices needs to be represented as a Kroneker product of 2 smaller matrices. While that may be applicable in some cases it is not generally applicable.

>>?GEMM does more multiplications and additions by design:

>>

>>C = alpha*A*B + beta*C

Would I consider that as a generic case? No. Have we reached the bottom of the ocean? Yes.

Quote:No I did not get the result off the web, I run the test myself using LinX AVX.

Number of cores is 4 (Haswell 4770K with HTT disabled).

I just ran my SB/IV dgemm and I measured 98.7 GFLOPs @ 3.4 GHz. If you scale it to 4.0 GHz then I get the same performance you quoted.. 116 GFOPs. Just another data point Sergey..

Perfwise

## Pages

## Login to leave a comment.