Loading...
You are not logged-in Login/Register





  • Posts   Search Threads
  • svyatoslav.korneevJuly 6, 2009 4:24 AM PDT   
    Cluster 2D FFT very Slow, Why?

    Hellow.

    I have problem.

    Intel Cluster FFT example (/opt/intel/Compiler/11.0/083/mkl/examples/cdftf) execute very slow on my cluster. And if I increase number of process, execution time decrease. Execution time statistic for "STATUS = DftiComputeForwardDM(DESC,LOCAL)", field 512*512 (first column MPI_RANK, second execution time per sec):

    DFTI_FORWARD_DOMAIN = DFTI_COMPLEX
    DFTI_PRECISION = DFTI_DOUBLE
    DFTI_DIMENSION = 2
    DFTI_LENGTHS = (512,512)
    DFTI_FORWARD_SCALE = 1.0
    DFTI_BACKWARD_SCALE = 1.0/(M*N)

    CREATE= 0

    8 process:

    0 0.2209660
    7 0.2209670
    1 0.2229670
    6 0.2209670
    3 0.2229670
    4 0.2229670
    2 0.2229670
    5 0.2219670

    16 process:

    0 0.2129680
    3 0.2129680
    1 0.2129680
    6 0.2129680
    4 0.2129680
    5 0.2129670
    2 0.2129680
    7 0.2129670
    13 0.2389640
    9 0.2389640
    15 0.2389640
    11 0.2389630
    12 0.2389640
    14 0.2389630
    8 0.2389630
    10 0.2389640

    32 process:

    0  0.5439169   
    5  0.5519149   
    1  0.5519161   
    7  0.5519171   
    3  0.5519159   
    4  0.5529160   
    28  0.3739430   
    13  0.5509160   
    18  0.2789580   
    6  0.5019231   
    2  0.5539160   
    9  0.5529160   
    12  0.5499170   
    8  0.5529151   
    15  0.5509162   
    11  0.5509150   
    14  0.5509160   
    10  0.5509150   
    20  0.2789570   
    16  0.2789580   
    21  0.2789570   
    17  0.2789590   
    22  0.2789570   
    19  0.2789580   
    23  0.2789580   
    24  0.3739420   
    27  0.3739440   
    31  0.3739430   
    25  0.3739430   
    29  0.3739440   
    30  0.3739430   
    26  0.3739430

    64 process:

    30 1.019846
    49 0.3459470
    45 0.3499470
    5 1.026844
    0 0.3339500
    2 1.021845
    6 1.031843
    1 1.024845
    4 1.027844
    3 1.022845
    7 1.024844
    58 0.3379490
    21 1.008847
    13 1.020845
    33 0.6359040
    31 1.023844
    27 1.030843
    29 1.026844
    25 1.027844
    28 1.016845
    24 1.027843
    26 1.031843
    52 0.3439469
    48 0.3469470
    53 0.3429482
    51 0.3449471
    55 0.3409491
    54 0.3419471
    50 0.3459470
    32 1.012846
    38 0.3569450
    37 0.3579450
    36 0.4479311
    35 0.4559300
    39 0.3559461
    34 0.4559309
    44 0.3499467
    41 0.3529470
    40 0.3539469
    46 0.3489470
    47 0.3479462
    43 0.3509469
    42 0.3519461
    59 0.3379490
    57 0.3369482
    62 0.3349490
    63 0.3339500
    61 0.3359480
    60 0.3379490
    56 0.2829571
    10 1.019846
    9 1.027843
    15 1.024845
    14 1.018845
    8 1.020845
    11 1.020844
    12 1.018845
    17 1.013845
    18 1.014845
    22 1.010846
    20 1.015846
    19 1.010846
    23 1.010846
    16 1.016845

    Cluster one module config:

    processor    : 0
    vendor_id    : GenuineIntel
    cpu family    : 6
    model        : 15
    model name    : Intel(R) Xeon(R) CPU            5140  @ 2.33GHz
    stepping    : 6
    cpu MHz        : 2333.423
    cache size    : 4096 KB
    physical id    : 0
    siblings    : 2
    core id        : 0
    cpu cores    : 2
    fpu        : yes
    fpu_exception    : yes
    cpuid level    : 10
    wp        : yes
    flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
    bogomips    : 4670.17
    clflush size    : 64
    cache_alignment    : 64
    address sizes    : 36 bits physical, 48 bits virtual
    power management:

    processor    : 1
    vendor_id    : GenuineIntel
    cpu family    : 6
    model        : 15
    model name    : Intel(R) Xeon(R) CPU            5140  @ 2.33GHz
    stepping    : 6
    cpu MHz        : 2333.423
    cache size    : 4096 KB
    physical id    : 3
    siblings    : 2
    core id        : 0
    cpu cores    : 2
    fpu        : yes
    fpu_exception    : yes
    cpuid level    : 10
    wp        : yes
    flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
    bogomips    : 4666.87
    clflush size    : 64
    cache_alignment    : 64
    address sizes    : 36 bits physical, 48 bits virtual
    power management:

    processor    : 2
    vendor_id    : GenuineIntel
    cpu family    : 6
    model        : 15
    model name    : Intel(R) Xeon(R) CPU            5140  @ 2.33GHz
    stepping    : 6
    cpu MHz        : 2333.423
    cache size    : 4096 KB
    physical id    : 0
    siblings    : 2
    core id        : 1
    cpu cores    : 2
    fpu        : yes
    fpu_exception    : yes
    cpuid level    : 10
    wp        : yes
    flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
    bogomips    : 4666.79
    clflush size    : 64
    cache_alignment    : 64
    address sizes    : 36 bits physical, 48 bits virtual
    power management:

    processor    : 3
    vendor_id    : GenuineIntel
    cpu family    : 6
    model        : 15
    model name    : Intel(R) Xeon(R) CPU            5140  @ 2.33GHz
    stepping    : 6
    cpu MHz        : 2333.423
    cache size    : 4096 KB
    physical id    : 3
    siblings    : 2
    core id        : 1
    cpu cores    : 2
    fpu        : yes
    fpu_exception    : yes
    cpuid level    : 10
    wp        : yes
    flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
    bogomips    : 4666.78
    clflush size    : 64
    cache_alignment    : 64
    address sizes    : 36 bits physical, 48 bits virtual
    power management:

    Cluster have 990 such modules.

    Cluster start one process per one core.

    I make example by:

    make libem64t mpi=mpich interface=ilp64


    Help me please, why it's so slow.

    Svyatoslav



    Vladimir Petrov (Intel)July 6, 2009 8:17 AM PDT
    Rate
     
    Re: Cluster 2D FFT very Slow, Why?

    Svyatoslav,

    First of all, looking at your data one can conclude that your cluster seem to have some problems - note, for 64 processes the times differ by a factor of 3!

    Second, if the problem size is rather small and is fixed for all number of processes, the computation time will increase when you increase the number of nodes - this is caused by the size of data sent from one process to another decreasing, thus increasing the latencies.

    In order to utilize the full computing power of your cluster you need to challenge it with big enough transform size. In general, the best performance (in terms of gigaflops) is achieved for transforms which utilize all the memory available on each node. However, please keep in mind that due to additional buffers being allocated the local part of the data being transformed has to occupy about 25% of the local memory.

    Best regards,
    Vladimir


Forum jump:  

Intel Software Network Forums Statistics

16,376 users have contributed to 46,361 threads and 164,027 posts to date.

In the past 24 hours, we have 12 new thread(s) 30 new posts(s), and 25 new user(s).

In the past 3 days, the most popular thread for everyone has been Program compiles in release but not debug The most posts were made to You need to show us the whole The post with the most views is vectorization of sin/cos results in wrong values

Please welcome our newest member brownwatch75


For more complete information about compiler optimizations, see our Optimization Notice.