Haswell GFLOPS

Haswell GFLOPS

Hi Intel Experts:

    I cannot find the latest Intel Haswell CPU GFlops, could you please let me know that?

    I want to understand the performance difference between Haswell and Ivy-bridge, for example, i7-4700HQ and i7-3630QM. From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base). Could you please let me know that of i7-4700HQ?

    I get some information from internet that: 

        Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide AVX addition and 8-wide AVX multiplication.

        Intel Haswell have the following floating-point performance: 32-SP FLOPS/cycle --> two 8-wide FMA (fused multiply-add) instructions

    I have two questions here:

    1. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2.4G (Clock) = 153.6 GFLOPS = 76.8 X 2. Does it mean that one operation is a combined addition and multiplication operation?

    2. Does Haswell have TWO FMA? 

    Thank you very much for any comments.

Best Regards,

Sun Cao

publicaciones de 72 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

>>...Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide
>>AVX addition and 8-wide AVX multiplication...

If you have Haswell and Ivy Bridge systems you could easily evaluate their real performance and you need to use a Vec_samples.zip sample from Intel Parallel Studio XE 2013.

Hi Sergey:

    I do not have Haswell systems now.

    Even I have it, it will be very helpful if Intel could provide me more information.

Best Regards,

Sun Cao

>>...Does Haswell have TWO FMA?..

There are 6 different groups of FMA instructions ( 60 instructions in total ) and please take a look at:

software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available

Haswell execution engine has two Ports dedicated also  to FMA(one FMA per port) instructions(Port0 and Port1) so you have doubled bandwidth of gflops/cycle.

On Haswell one FMA operation combines  multiplication and  addidtion when compared to previous architecture such a operation could stall two ports when executing at the same time.

>>I do not have Haswell systems now.
>>
>>Even I have it, it will be very helpful if Intel could provide me more information...

I agree with that. As soon as you have a Haswell system you could do a veri quick evaluation of performance with Vec_samples.zip from ..\Composer XE\Samples\en_US\C++ folder ( for a Windows platform )

Here are some additional technical details:

Compiler options: /O3 /Qstd=c99 /Qrestrict /Qipo

...
#define ALIGNED
#define NOALIAS
#define NOFUNCCALL // Note: Inlining
...

[ Test 1 - No Vectorization & No Inlining & No IPO & /O2 are used - Release ]

ROW:256 COL: 256
Execution time is 12.750 seconds
GigaFlops = 0.673720
Sum of result = 1279224.000000

[ Test 2 - Vectorization & Alignment & Inlining & IPO & /O3 are used - Release ]

ROW:256 COL: 256
Execution time is 4.734 seconds
GigaFlops = 1.814519
Sum of result = 1279224.000000

As you can see Test 2 is ~2.7 times faster then Test 1.

>>>>...i7-3630QM's GFlops is 76.8 (Base)...
>>
>>GigaFlops = 1.814519

By the way, two numbers I gave you are for Pentium 4 and you can see that i7-3630QM is ~42x faster when processing is done using all cores.

Let me know if you're interested to see numbers for Ivy Bridge system.

>>>By the way, two numbers I gave you are for Pentium 4 and you can see that i7-3630QM is ~42x faster when processing is done using all cores.>>>

Are those results obtained from testing Vec_samples?

Afaik Pentium 4 cannot calculate at the same time fadd and fmul.Haswell core is able to  schedule for execution one FMA(two fp instructions) per one thread it is a tremendous improvement in raw processing power when compared to Pentium 4

>>Are those results obtained from testing Vec_samples?

Yes and you could take a look at it because the project is in Samples folder.

Thanks

>>...From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base)...

Sun Cao,

I couldn't find information about GFlops on ark.intel.com and my question is where did you find that number?

Actually on Ivy Bridge you have 1 wide fadd/cycle and 1 wide fmul/cycle  it can be either SP(8 flops) or DP(4 flops) and mulitplied by 4 cores and by clock grequency 2.4 ghz = 76.8 Gflops.

>>>>...From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base)...
>>
>>Sun Cao,
>>
>>I couldn't find information about GFlops on ark.intel.com and my question is where did you find that number?

This is how it looks like in reality:

[ Test 1 on a system with Pentium 4 ]
[ SSE2 - 32-bit Intel C++ compiler options - 1 CPU used ]

Note: For all test cases /O3 /QaxSSE2 /Qstd=c99 options are used

GigaFlops = 1.808407 -
GigaFlops = 1.814136 - /Qrestrict /Qansi-alias
GigaFlops = 1.844917 - /Qrestrict /Qansi-alias /Qipo
GigaFlops = 1.851279 - /Qrestrict /Qansi-alias /Qipo /Qunroll=4
GigaFlops = 1.889559 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8
GigaFlops = 2.147484 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 (*)
GigaFlops = 1.814519 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3
GigaFlops = 1.929022 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 /Qopt-prefetch:4
GigaFlops = 0.628287 - /Qrestrict /Qansi-alias /Qparallel
GigaFlops = 0.628333 - /Qrestrict /Qansi-alias /Qipo /Qparallel

(*) - Best result

[ Test 2 on a system with Ivy Bridge ]
[ AVX - 64-bit Intel C++ compiler options - 1 CPU used ]

Note: For all test cases /O3 /QaxAVX /Qstd=c99 options are used

GigaFlops = 11.228673 -
GigaFlops = 11.228673 - /Qrestrict /Qansi-alias
GigaFlops = 11.243370 - /Qrestrict /Qansi-alias /Qipo
GigaFlops = 9.326748 - /Qrestrict /Qansi-alias /Qipo /Qunroll=4
GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8
GigaFlops = 11.243370 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 (*)
GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3
GigaFlops = 11.228673 - /Qrestrict /Qansi-alias /Qipo /Qunroll=8 /Qopt-block-factor:3 /Qopt-mem-layout-trans:3 /Qopt-prefetch:4

[ AVX - 64-bit Intel C++ compiler options - 8 CPUs used ]

GigaFlops = 60.333168 - /Qrestrict /Qansi-alias /Qparallel (*)
GigaFlops = 60.333168 - /Qrestrict /Qansi-alias /Qipo /Qparallel (*)

Note: 60.33316 = 7.541646 * 8

(*) - Best result

As you can see my number is ~21% lower that Intel's number and this is because our test cases are different. I don't think we will know how 76.8 number was measured unless Intel releases source codes, or informs everybody that some Open Source test was used.

Best Reply

Hi Sergey:

    You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm

>>...You can find CPU GFlops at: http://www.intel.com/support/processors/sb/CS-017346.htm...

Hi, Thank you and I'll take a look.

>>>As you can see my number is ~21% lower that Intel's number and this is because our test cases are different. I don't think we will know how 76.8 number was measured unless Intel releases source codes, or informs everybody that some Open Source test was used.>>>

It could be theoretical peak performance bandwidth.Real application can affect this result by introducing memory stalls or instruction interdependencies.

Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL.

Regards,
Igor Levicki

>>...Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL...

Thanks for the tip regarding Linpack. I did a verification using older version of Linpack and numbers for Pentium 4 are 4x (!) lower:
...
Mflops
580.59
532.56
578.32
587.83
532.69
Average 562.40
...
That is 0.562Gflops and it was just a quick verification of my numbers.

>>...Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL..>>>

Haswell can pose a challenge for low end GPUs in terms of DP Gflops.

>>...Haswell can pose a challenge for low end GPUs in terms of DP Gflops...

What challenge? And why should it be a concern regarding GPUs? I really didn't understand what you wanted to say.

Personally I'm trying to evaluate performance differences between 3 major lines of CPUs: Pentium 4, Ivy Bridge and Haswell.

>>>What challenge? And why should it be a concern regarding GPUs? I really didn't understand what you wanted to say.>>>

It was only general comment.

I meant in terms of raw DP Gflops processing power Haswell microarchitecture is closing gap with lower end GPU's so in foreseable future it can be used to perform software rendering.

Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249Gflops.

>>It was only general comment.
>>
>>I meant in terms of raw DP Gflops processing power Haswell microarchitecture is closing gap with lower
>>end GPU's so in foreseable future it can be used to perform software rendering...

That sounds really interesting and who is defining that foreseable future and who is going to use lower end GPUs with Haswell systems?

>>Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL.

Igor, I've used Linpack and these numbers are more consistent with Intel's numbers:

...
Ivy Bridge - Performance Summary (GFlops) Average = 71.9007
...
Pentium 4 - Performance Summary (GFlops) Average = 1.9561
...

Ivy Bridge performance is also closely matches to what Caosun posted, that is 76.8GFlops ( as far as I understood this is Intel's number ).

>>>That sounds really interesting and who is defining that foreseable future and who is going to use lower end GPUs with Haswell systems?>>>

I am talking about raw performance comparision between Haswell and some lower to mid range GPU's.Usage of cpu for software rendering is already a reality.

http://www.inartis.com/products/kribi%203D%20Engine/Default.aspx

>>>who is defining that foreseable future>>>

Probably Intel by releasing wider architecturally execution engine designs.

>>>Ivy Bridge - Performance Summary (GFlops) Average = 71.9007>>>

Close to theoretical peak.


>>>>Speed for Haswell running at 4GHz here is ~116GFlops in Intel optimized linpack from MKL.
>>
>>Igor, I've used Linpack and these numbers are more consistent with Intel's numbers

Igor, Did you get 116 GFlops number from some website ( 1st ) or after real testing on a Haswell system ( 2nd )? In the 2nd case How many cores were used during the test?

Quote:

Sergey Kostrov wrote:
Igor, Did you get 116 GFlops number from some website ( 1st ) or after real testing on a Haswell system ( 2nd )? In the 2nd case How many cores were used during the test?

in case you are interested I published a result of mine here: http://www.realworldtech.com/forum/?threadid=134512&curpostid=134594
I measured better than 93% efficiency with a workset entirely in the L1D and a compute:load:store ratio of 11:1:1

 

>>>>...Igor, Did you get 116 GFlops number..
>>
>>...I measured 104.407 Gflops ( 112 Gflops peak )...

Results are consistent and the difference is ~3.45% ( it is acceptible ). My question is the same: How many cores were used during the test?

>>...I published a result of mine here: http://www.realworldtech.com/forum/?threadid=134512&curpostid=134594

One more thing regarding a thread:

Rumor mill: 512-bit AVX3 in Skylake

I don't consider it as a rumor. A header file with some 512-bit-stuff could be found in Intel Parallel Studio XE 2013 ( ..\Compiler\Include folder ) and I know about it since December 2012.

>>...A header file with some 512-bit-stuff could be found in Intel Parallel Studio XE 2013...

zmmintrin.h

Quote:

Sergey Kostrov wrote:
Results are consistent and the difference is ~3.45% ( it is acceptible ). during the test?

it's neither the same test nor the same test platform (even CPU frequencies look different) so IMHO there is no point to compare the results

Quote:

Sergey Kostrov wrote:
My question is the same: How many cores were used during the test?

the test I reported above was with a single thread on a single core and only vfmadd213ps as compute instructions, I can't comment on the other test though

Quote:

Sergey Kostrov wrote:
I don't consider it as a rumor. A header file with some 512-bit-stuff could be found in Intel Parallel Studio XE 2013 ( ..\Compiler\Include folder ) and I know about it since December 2012.

this header is for Xeon Phi targets

 

>>...it's neither the same test nor the same test platform (even CPU frequencies look different) so IMHO there is
>>no point to compare the results...

What was the point of mentioning or posting these results?

A simple test based on just one instruction vfmadd213ps can not be considered as a valid one, or as a test that really evaluates performance of some system. Additions are always faster then Multiplications and everybody knows that.

If you have a Haswell system then, as Igor recommended, a Linpack from MKL could be used ( I've verified it on P4 and IB systems and it gives right numbers ).

>>...this header is for Xeon Phi targets

It doesn't say anything at the beginning and some time ago I've asked Intel software engineers what is it for. Unfortunately, my question was Not answered.

Quote:

Sergey Kostrov wrote:
What was the point of mentioning or posting these results?

well, this thread is named "Haswell GFLOPS" and this test of mine measures Haswell GFLOPS so I suppose it is at least somewhat relevant

Quote:

Sergey Kostrov wrote:
A simple test based on just one instruction vfmadd213ps can not be considered as a valid one

I don't get what you mean, any test wanting to max out GFLOPS on Haswell will use only FMA instructions for computations

>>...I don't get what you mean, any test wanting to max out GFLOPS on Haswell...

Run Linpack benchmark utility from MKL installation to verify your numbers. Post results as soon as it is done.

Quote:

Sergey Kostrov wrote:

>>...I don't get what you mean, any test wanting to max out GFLOPS on Haswell...

Run Linpack benchmark utility from MKL installation to verify your numbers. Post results as soon as it is done.

as already explained the tests aren't comparable so one can't be used to verify the other, mine is with higher compute:load/store ratio than LINPACK, I use an unrealistic very high compute:load:store 11:1:1 ratio as mentioned in my post at RWT, the goal was to come close to the 2x FMA vs ADD+MUL theoretical speedup

What Haswell system do you have?

>>...as already explained the tests aren't comparable so one can't be used to verify the other...

I understand it and I don't want to compare and I simply would be glad to see some numbers from Linpack utility. If I would have a Haswell system I would do the test without any problems.

When testing my Ivy Bridge system with two different Linpack benchmark utilities ( from Intel and Non Intel from another source ) only Intel's utility gave very consistent results. Once again, why wouldn't you try to run it? If you don't have MKL I could upload all content of a ..\mkl\benchmarks folder.

Once again, I don't what to compare Linpack number with a number from your own test. I want to compare your Haswell Linpack number with my Ivy Bridge Linpack number and with my Pentium 4 Linpack number.

This is a content of ..\mkl\benchmarks folder:

help.lpk
lininput_xeon32
lininput_xeon64
linpack_xeon32.exe
linpack_xeon64.exe
runme_xeon32.bat
runme_xeon64.bat
xhelp.lpk

( all files are about 6MB in total )

Quote:

Sergey Kostrov wrote:
What Haswell system do you have?

4770K / 16 GB DDR3-2400 memory / Corsair H110 cooler / ASUS Z87-Pro mobo

Quote:

Sergey Kostrov wrote:
I understand it and I don't want to compare and I simply would be glad to see some numbers from Linpack utility.

if these are easy to run I can have a try, I'm downloading Studio XE 2013 for Windows Update 4 right now (1.11 GB, ETA 1hr 43 min !) so I'll have the latest MKL (the one in C++ Composer XE 2013 Update 5), is it the same version you are interested in ?

Quote:

Sergey Kostrov wrote:

linpack_xeon32.exe
linpack_xeon64.exe
runme_xeon32.bat
runme_xeon64.bat

I'm just finished running these two tests (MKL released with Composer XE 2013 Update 5 / default MKL bench .bat files / Windows 8 pro 64-bit / CPU @ 4 GHz / realtime process priority), xeon64 is incredibly long to run, pretty boring since there isn't any feedback about its progress, anyway you'll see the result files attached, hope it will be helpful for your purpose

Adjuntos: 

AdjuntoTamaño
Descargar win-xeon32.txt2.73 KB
Descargar win-xeon64.txt3.86 KB

>>...I'm just finished running these two tests...

Thank you very much! I'll also post results for systems with Ivy Bridge and Pentium 4 for comparison.

[ Iliya Polak wrote ]
>>...Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249 Gflops.

Iliya,

Where / how did you get that number?

Please explain because it is more than twice greater than the best number in the 2nd test of bronxzv ( 104.2632 GFlops ). Igor's number ( 116 GFlops ) is very close to bronxzv's number ( ~10% difference ).

Quote:

Sergey Kostrov wrote:

[ Iliya Polak wrote ]
>>...Actually in Turbo Mode at 3.9Ghz theoretical peak performance expressed in DP Glops is ~249 Gflops.

Iliya,

Where / how did you get that number?

he simply mentions DP theoretical peak at 3.9 GHz and 4 cores (8 DP flop per FMA instruction, 2 FMA per clock), i.e 3.9*4*8*2 = 249.6 Gflops

note that in my own report I mentioned SP theoretical peak at 3.5 GHz and 1 core (16 SP flop per FMA instruction, 2 FMA per clock), i.e 3.5*1*16*2 = 112 GFlops

with my configuration MKL LINPACK efficiency = 104.3/249.6 = ~41.8 %

my own FMA microbenchmark efficiency =  104.407/112 = ~93.2 % 

as explained this is because my own test has very low load/store activity vs FMA computations, and most of these load/stores are from/to the L1D cache 

Attached are several files with results of performance tests using Intel LINPACK Benchmark for 32-bit and 64-bit systems with Haswell, Ivy Bridge and Pentium 4 CPUs.

Adjuntos: 

All,

    I have not seen any results from MKL on Haswell.  I've ported my dgemm to use FMA3 which Haswell supports.  On 1 core, with a fixed frequency @ 3.4 GHz, I am achieving 90% efficiency in DGEMM.  On 4 cores, i'm achieving 82% efficiency, or 179 GFLOPs.  Those HPL efficiencies Sergey are quite poor on HW.  On HPL on SB/IB.. I think 90% efficiency is a good number (my DGEMM on those arch is 95-96% efficient and you loose 5-7% in a well tuned HPL from the DGEMM efficiency).  Later I may just post the DGEMM test in case any interested parties are interested in running it.  Just thought I'd let others know you can get ~50 GFLOPs on 1 core at 3.4 GHz.  I'm not observing that the efficiency is scaling though with multiple cores.. yet.  Lastly these are preliminary numbers.

    I thought I'd also post that I've not been able to get a read bandwidth from the L1 that saturates at 64B per clk.. at 3.4 GHz, I've achieved 58.5 B/clk of read bandwidth.  Likewise.. my efforts to maximize the copy bandwidth haven't been successful, I've achieved 58.2 B/clk of copy bandwidth.  L2 bandwidth is no where near 64B per clk.. but arount 246B per clk is what I've achieved for read bandwidth.  If you have any results on cache io.. on your hardware.. let me know.

Perfwise

Ok.. I've got my dgemm at 91-92% efficiency.. on 1-core.  That's ~50 GFLOPs on 1 HW core at 3.4 GHz.  4-core numbers on a 8000 cubed DGEMM are 186.6 GFLOPs, which is a hair over 85% efficiency.  Power.. at idle is ~45 W.. when running this code it's 140W.  Interesting.  Might be able to get a bit more.. out of it.  Just thought I'd update.. as what I've obersved so far in terms of Haswell high performance code efficiency. 

>>... I have not seen any results from MKL on Haswell. I've ported my dgemm...

I could post performance results of MKL's sgemm and dgemm functions on Ivy Bridge for 4Kx4K, 8Kx8K and 16Kx16K matricies ( in seconds Not in GFlops ).

>>... I've got my dgemm at 91-92% efficiency...

What algorithm do you use?

Sergey,

    Building for IB is pointless.. it doesn't use FMA3 which you need to use to max out the MFLOPs.  Also.. I focus on DGEMM and then HPL.  HPL is limited by DGEMM performance but on a particular set of problems where if you consider a DGEMM of a MxK upon a KxN matrix.. M and N are large and K is small.  For the boeing sparse solver.. N can also be small.  K is somewhat tunable.. and is a blocking parameter.  I just ran my dgemm for SB/IB and upon a 8000 x 128 x 8192 [M x K x N] problem.. it achieved 24.3 GFOPs on 1 core at 3.4 GHz.. which is ~90% efficiency for those 2 arch.  For an 8000 x 8000 x 8000 problem I get over 100 GFLOPS on 4 IB cores.  For HW.. running a similar problem I'm getting 45.7 GFOPs on 1 core at 3.4 GHz, which is 84% efficiency.  Running with K=256 I get 46.5 GFLOPs (85.5% efficiency) and with K=384 I get 48.5 GFLOPS (89% efficiency).  Assymptotic efficiency is 92.5%, about 3% below that of SB and IB.. but it's somewhat expected.  Amdahl's law is coming into place and the overheads of doing this "real" computation.. are chewing a bit into the efficiency.  I think I'll improve it as time goes on.. but just thought I'd throw it out there what I have measured/achieved.. to see if anyone else has some real numbers.  On a HW with 4 cores running at 3.4 GHz on a 16000 x 8192 x 8192 problem I just achieved 190.4 GFLOPs, or 87.5% efficiency.  I'd expect Intel to do better than my 2 days worth of tuning on a full DGEMM.  

As far as algorithm.. I'd rather not divuldge my techniques but there's lots of documentation on this subject.. and a long history of a few but good people doing it in the past, much fewer in the present.  It's my own code and just a hobby.. but you or others should try doing it yourself.  You'll learn alot about performance which isn't documented or discussed.. and you'll be a better tweaker for it.

Perfwise

>>...Building for IB is pointless.. it doesn't use FMA3...

I don't get it and Ivy Bridge systems will be around for a long time. There is nothing wrong when comparing results for major lines of Intel microarchitectures even if some microarchitecture doesn't support some set of instructions!

>>...Assymptotic efficiency is 92.5%...

What Time Complexity do you assume as a base? Theory is very strict here and there is nothing classified for matrix multiplication algorithms regarding Time Complexity. For example, here is a table:

Time Complexity for Matrix Multiplication Algorithms:

Virginia Vassilevska Williams O( n^2.3727 )
Coppersmith-Winograd O( n^2.3760 )
Strassen O( n^2.8070 ) <= O( n^log2(7) )
Strassen-Winograd O( n^2.8070 ) <= O( n^log2(7) )
Classic O( n^3.0000 )

A fastest algorithm I've ever used / tested is Kronecker based Matrix Multiplication implemented by one of IDZ user in Fortran language. Details could be found here: http://www.geosci-model-dev-discuss.net/5/3325/2012/gmdd-5-3325-2012.html

>>...As far as algorithm.. I'd rather not divuldge my techniques but there's lots of documentation on this subject.. and a
>>long history of a few but good people doing it in the past, much fewer in the present. It's my own code and just a hobby...

Just posts results for a couple of cases in seconds since it will be easier to compare. I will post results for Ivy Bridge system for 4Kx4K, 8Kx8K and 16Kx16K matricies ( in seconds ) using MKL's dgemm, Kronecker based and Strassen HBC Matrix Multiplication algorithms.

The reason I post results in seconds because I need to know exactly that product of two matricies could be calculated in some limited period of time. Results in GFlops are useless in many cases because I hear all the time questions like How long does it take to compute a product of two matricies with some dimensions NxN?...

Note: Strassen HBC stands for Strassen Heap Based Complete and it is optimized for application in Embedded environments.

Sergey,

    From my experience working with people in the industry, I define matrix multiplication FLOPs as that from traditional Linear Algebra, which is 2 * N^3, or to be precise 2 * M * K * N.  These other methods you mention, entail lowered numerical accuracites, greater memory useage or difficulties in implementation which give rise to fewer flops, but lower ipc and lower performance.  Maybe I'll try them someday.. but from my experience.. and that of those people I've worked with in the past 20 years.. I've not found them widely applied.  So now that you know how I'm measuring FLOP count and you know I'm running at 3.4 GHz, which btw on this topic I've yet to mention but I'd only post results with a frozen frequency rather than include those with turbo boost, you can determine the # of clks or seconds it takes on HW to do a computation of DGEMM.  I measured the following today:

SIZE^3, 1-core GFLOPs, 1-core TIME(s), 4-core GFLOPs, 4-core TIME(s)

4000, 50.3, 2.54, 172.8, 0.74

8000, 50.4, 20.3, 186.6, 5.49

16000, 51.3, 159.7, 192.7, 42.5

I think it's important to note.. that square problems are not very useful in DGEMM.. you need to focus on the other sizes I mentioned in the previous posts.. for "practical" solvers

Perfwise

Páginas

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya