Loading...
You are not logged-in Login/Register





  • Posts   Search Threads
  • BradleyKuszmaulSeptember 1, 2009 1:18 PM PDT   
    Performance Results

    Here are some basic performance results on a 1.6GHz Dual-core Pentium Laptop:
    1024x1024x1024 matrix
    MM (3 loops):  30s
    Strassen:         5.5s
    Intel MKL:         0.75s
    MKL 2threads: 0.45s

    MM (3 loops) is Clay's loop code.
    Strassen is Clay's Strassen code.
    Intel MKL: Is the Intel Math Kernel Library running DGEMM, on one core.
    MKL 2threads: Is 2 threads

    The MKL is not using Strassen.  So if absolute performance is what we are interested in, this code needs a lot of work.

    It's been my experience that Strassen is at a disadvantage until the matrices get pretty large.  We may need to measure at least 4096x4096x4096 or larger to see an advantage over a good O(n^3)  implementation using blocked matrixes to minimize cache misses.

    -Bradley



    planetmarshallSeptember 1, 2009 3:40 PM PDT
    Rate
     
    Re: Performance Results

    Quoting - BradleyKuszmaul
    Here are some basic performance results on a 1.6GHz Dual-core Pentium Laptop:
    1024x1024x1024 matrix
    MM (3 loops):  30s
    Strassen:         5.5s
    Intel MKL:         0.75s
    MKL 2threads: 0.45s

    MM (3 loops) is Clay's loop code.
    Strassen is Clay's Strassen code.
    Intel MKL: Is the Intel Math Kernel Library running DGEMM, on one core.
    MKL 2threads: Is 2 threads

    The MKL is not using Strassen.  So if absolute performance is what we are interested in, this code needs a lot of work.

    It's been my experience that Strassen is at a disadvantage until the matrices get pretty large.  We may need to measure at least 4096x4096x4096 or larger to see an advantage over a good O(n^3)  implementation using blocked matrixes to minimize cache misses.

    -Bradley


    What compiler are you using? I got similar poor performance of the standard loop using MSVC 2008, however with the Intel Compiler 11.1 on Suse 11.1 x86 with default optimizations ( no vectorization or threading ) I got the following for n=1024 m=1024 p=1024

    Standard matrix function done in 1.82 secs
    Strassen matrix function done in 2.69 secs

    Machine spec Dell M1330 XPS 4Gb RAM, Core 2 Duo@2.20 Mhz T7500

    I have many more metrics to do, but I think your guesses at the threshold of the effectiveness of Strassen are pessimistic. Research has suggested that a matrix size as small as 16x16 is enough to show an improvement for optimized implementations ( see the paper referred to in the Resources thread ).

    Good luck with your solution,
    Andrew.


    --
    Andrew Marshall
    http://www.planetmarshall.co.uk

    BradleyKuszmaulSeptember 1, 2009 4:08 PM PDT
    Rate
     
    Re: Performance Results

    Quoting - planetmarshall
    Quoting - BradleyKuszmaul
    Here are some basic performance results on a 1.6GHz Dual-core Pentium Laptop:
    1024x1024x1024 matrix
    MM (3 loops):  30s
    Strassen:         5.5s
    Intel MKL:         0.75s
    MKL 2threads: 0.45s


    What compiler are you using? I got similar poor performance of the standard loop using MSVC 2008, however with the Intel Compiler 11.1 on Suse 11.1 x86 with default optimizations ( no vectorization or threading ) I got the following for n=1024 m=1024 p=1024

    Standard matrix function done in 1.82 secs
    Strassen matrix function done in 2.69 secs

    Machine spec Dell M1330 XPS 4Gb RAM, Core 2 Duo@2.20 Mhz T7500

    I have many more metrics to do, but I think your guesses at the threshold of the effectiveness of Strassen are pessimistic. Research has suggested that a matrix size as small as 16x16 is enough to show an improvement for optimized implementations ( see the paper referred to in the Resources thread ).

    Good luck with your solution,
    Andrew.
    I'm using gcc 4.3.2  I guess I'm not surprised that the intel compiler did better.  In particular, I implemented a divide-and-conquer version of the O(n^3) algorithm, and it didn't do much better than the simple 3-nested loop.  I suspect that gcc is doing a bad job at compiling
    C[i][j] += A[i][k]*B[k][j]
    I suspect that gcc would do better on a single-dimensional array with explicit index calculations.  I'll also try the intel compiler, if I get a chance.

    You may be right that I'm pessimistic about Strassen.  We'll see what the answers look like, and it has been a few years since I measured Strassen vs O(n^3).  I don't think that we can draw much conclusion from a 1990 study on a Cray YMP to apply to a 20-year-newer technology (Intel i7).  One big difference is that the Cray machines didn't have cache and their memory pipeline could keep the CPU busy.  Today, however, cache misses are the dominant performance problem.


    iarchitectSeptember 1, 2009 4:35 PM PDT
    Rate
     
    Re: Performance Results

    Quoting - planetmarshall

    What compiler are you using? I got similar poor performance of the standard loop using MSVC 2008, however with the Intel Compiler 11.1 on Suse 11.1 x86 with default optimizations ( no vectorization or threading ) I got the following for n=1024 m=1024 p=1024

    Standard matrix function done in 1.82 secs
    Strassen matrix function done in 2.69 secs

    Machine spec Dell M1330 XPS 4Gb RAM, Core 2 Duo@2.20 Mhz T7500

    I have many more metrics to do, but I think your guesses at the threshold of the effectiveness of Strassen are pessimistic. Research has suggested that a matrix size as small as 16x16 is enough to show an improvement for optimized implementations ( see the paper referred to in the Resources thread ).

    Good luck with your solution,
    Andrew.

    I admit Intel compilers are faster than gcc or VC++. I got similar results as BradleyKuszmaul's on my Core 2 Duo@2 GHz / Ubuntu 8.x / Intel C++ 11.0x

    BTW, Do you mean the tripple for-loop function by "Standard matrix function?" I don't get how standard matrix function is way faster than Strassen when m=n=p=1024, which is large enough to show strassen's advantage, but is not so large to cause memory deficiencies.



    邓辉September 2, 2009 7:36 AM PDT
    Rate
     
    Re: Performance Results

    CPU 1.8GHz Dual-Core  1G Memory

    1024x1024x1024 Matrix
    MM (3 loops):   35.40s
    Strassen:         5.58s
    My Code 2threads: 0.80s

    There are gaps with MKL

    写字楼里写字间,写字间里程序员
    程序人员写程序,又拿程序换酒钱
    酒醒只在网上坐,酒醉还来网下眠
    酒醉酒醒日复日,网上网下年复年

    iarchitectSeptember 10, 2009 10:53 AM PDT
    Rate
     
    Re: Performance Results

    Quoting - 邓辉
    CPU 1.8GHz Dual-Core  1G Memory

    1024x1024x1024 Matrix
    MM (3 loops):   35.40s
    Strassen:         5.58s
    My Code 2threads: 0.80s

    There are gaps with MKL

    Wow that's quite fast. Now I am getting similar result. You might have more improved your code during the last week though.

    CPU: 1 GHz dual-core 2GB Mem.

    1024x1024x1024 Matrix
    2 threads: 1.54 sec

    Yeah, It's a 2GHz machine, but somehow it sticks to 1GHz when running Linux. It's like driving highway with transmission stuck in the first gear. 8-)


    BradleyKuszmaulSeptember 10, 2009 11:44 AM PDT
    Rate
     
    Re: Performance Results

    On a two-socket Nehalem, I'm seeing about 75GFLOPS on the cubic algorithm, I've seen an "effective" 98 GFLOPS on Strassen.  Thats 11.2s for an 8192x8192x8192 matrix multiply.

    I'm seeing about 0.036s for 1024x1024x1024 matrix multiply (that's only 59 GFLOPS effective), and the cubic algorithm is faster (70 GFLOPS).

    I cannot do 16Kx16Kx16K matrices since I have only 12GB of RAM.  So Strassen is barely better for large matrices.

    -Bradley



    rikmorganSeptember 10, 2009 11:47 AM PDT
    Rate
     
    Re: Performance Results

    Quoting - iarchitect

    Wow that's quite fast. Now I am getting similar result. You might have more improved your code during the last week though.

    CPU: 1 GHz dual-core 2GB Mem.

    1024x1024x1024 Matrix
    2 threads: 1.54 sec

    Yeah, It's a 2GHz machine, but somehow it sticks to 1GHz when running Linux. It's like driving highway with transmission stuck in the first gear. 8-)
    I wonder if using only 2 threads will benefit at all from the test environments 4 and 8 core systems. I used more threads than 2 just in the hope that it would scale better on the test machines.  .8 sec and 1.54 sec sure beat my 2.00 sec times for that size array. Can't wait to see how you all did that.  But I might catch you by scaling. :-)


    maroySeptember 11, 2009 1:35 PM PDT
    Rate
     
    Re: Performance Results

    Typical console output of the code running on a Dell T3400 with an Intel Core2 Quad CPU Q6700 @ 2.66 GHz (32 bit Vista)


    Strassen_TBB.exe 1024 1024 1024
    M: 1024, N: 1024, P: 1024

    Execute Standard matmult
    Done in: 17.83 secs

    Execute Parallel Standard matmult.
    Done in: 8.10 secs

    Results OKAY

    Execute Strassen matrix function as supplied by Intel.
    Done in: 1.56
    Results OKAY


    Execute Parallel Strassen matrix function.
    Done in: 0.50
    Results OKAY

    gk4v07September 11, 2009 5:24 PM PDT
    Rate
     
    Re: Performance Results

    4 Cores test output

    Processor (CPU): Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
    Speed: 2,399.92 MHz
    Cores: 4
    Memory Information
    Total memory (RAM): 2.0 GB
    openSUSE 11.1 (x86_64)


    ./strassenomp.run 4096 4096 4096
    Executing Standard matrix multiply
    Standard matrix multiply done in 44.58 secs (equivalent to 3082.99 MFLOPS)
    Executing Strassen matrix multiply

    ===== Detected 4 available threads
    Strassen matrix multiply done in 15.74 secs (equivalent to 8731.11 MFLOPS)

    OKAY




    nickbesSeptember 11, 2009 5:29 PM PDT
    Rate
     
    Re: Performance Results

    Quoting - 邓辉
    CPU 1.8GHz Dual-Core  1G Memory

    1024x1024x1024 Matrix
    MM (3 loops):   35.40s
    Strassen:         5.58s
    My Code 2threads: 0.80s

    There are gaps with MKL

    Very fast!!
    MM              16.5s
    Strassen        2.9s
    My code (only 2 threads)  0.77s


    CPU 2.4 Intel GHz Quad Core
    1024x1024x1024



Forum jump:  

Intel Software Network Forums Statistics

16,365 users have contributed to 46,338 threads and 163,916 posts to date.

In the past 24 hours, we have 29 new thread(s) 146 new posts(s), and 86 new user(s).

In the past 3 days, the most popular thread for everyone has been Formula for the intersection of straight lines The most posts were made to Take a look at John Burkhard&# The post with the most views is \"-check none\" generates error

Please welcome our newest member dozo1971


For more complete information about compiler optimizations, see our Optimization Notice.