Loading...
You are not logged-in Login/Register





  • Posts   Search Threads
  • braithrOctober 26, 2010 12:20 PM PDT   
    Characterizing Manycore Memory Access

    (cross-posted from my blog post on the Intel Software Blog):

    Memory access characteristics in manycore NUMA systems are not always obvious to the programmer. A process may see widely varying latency and bandwidth for memory accesses depending on which CPU the process is running and on which memory node the data is located.

    My initial results show that the Intel MTL machines exhibit nearly-constant memory access latency and bandwidth varying by up to 2 GB/s, compared to another architecture which exhibits latencies varying by up to 200ns and memory bandwidth varying by up to 4 GB/s. This speaks well of the Intel design, as latency-sensitive applications may not notice effects of the NUMA architecture when running on MTL machines. For bandwidth-sensitive applications, however, NUMA still presents a significant programming design challenge regardless of the architecture.

    Further tests are underway to determine how individual cores' memory access might vary, followed by higher-level application benchmarking in order to better understand the effects of manycore NUMA designs on high-performance computing applications.

    

    Jeff Gallagher (Intel)October 26, 2010 12:50 PM PDT
    Rate
     
    Characterizing Manycore Memory Access

    Nice posting, braithr!  I and MANY OTHERS are eager to hear the results of your further tests!

    Cheers,
    jdg



    jimdempseyatthecoveOctober 26, 2010 3:01 PM PDT
    Rate
     
    Characterizing Manycore Memory Access

    Braithr,

    Welcome to the ISN forums.

    When running your latency tests it is important to know the physical and BIOS setup of the system under tests. The MTL log-on and batch systems are configured differently. Also, the particular version of Linux and libnuma may affect your latency tests. What is unknown from a MTL user viewpoint is

    Is/are the systems configured to prefetch the next cache line?
    Is/are all memory sockets populated?
    Is/are all RAM sticks the same and w/wo ECC?
    Is/are the motherboards the same?
    Is/are the BIOS settings misc. settings the same?
      (log on system has HT enabled, the batch systems do not)
    Is the Linux system configured with First Touch policy?
    if so, does your app touch all of of allocations by the thread on the node of interest prior to running test.

    A narrow gap in the latency and bandwidth may be indicative of the configuration working against your programming intention.

    One of my collegues provided me with access to two dual processor Xeon 5570 systems. Virtually identical systems excepting for version of the Linux. My test app was measuring throughput. To our surprise, there was a significant difference in throughput between these two "identical" systems. Unfortunately we only had remote access to these systems so we couldn't inspect the systems as closely as we wanted to.

    It might be helpful for your purposes if the MTL would document the configurations in detail, and update this document as changes are made to the system.

    BTW, the MTL systems are quite impressive.

    Jim Dempsey


    Blog: The Parallel Void
    www.quickthreadprogramming.com

    dinwalOctober 28, 2010 5:26 PM PDT
    Rate
     
    Characterizing Manycore Memory Access

    That is an interesting observation. I would like to know what kind of application was used to observe this behavior. I am assuming this was a memory intensive application such as matrix transpose or multiplication of large matrices, but a confirmation would help. I would also like to know what was the maximum memory bandwidth achieved? Also, if you did run some benchmarks such as STREAM etc., comparing the maximum bandwidth reported by the benchmark with your application will be informative. If you did not do the benchmarking, I would like to do the STREAM and possibly few other tests if you can post the maximum bandwidth achieved by your application along with the memory requirements. Thank you for sharing your observation with us.


    Dinesh Agarwal

    jimdempseyatthecoveNovember 5, 2010 12:39 PM PDT
    Rate
     
    Characterizing Manycore Memory Access

    dinwal,

    Not having the two supposidly identical systems at my disposal, I cannot make a determination as to why the performance was measurably different.

    I suspect, but cannot confirm, that the memory manager in Linux (between first system and second system) enforced a first touch policy as opposed to which thread allocates pollicy.

    There are performance problems relating to using a "first touch" pollicy as opposed to which thread allocates.

    1) Forces the allocating thread to go out and touch the memory prior to (say) passing the memory block pointer over to an I/O thread to read-in the data.

    2) Experiences page faults for each page as you walk the memory pages after the allocation but prior to first use by work in your application.

    3) May require two additional thread context switches to ensure buffer properly "first touched"

    ... and a few others.

    "first touch" has its place, but it is not always the best choice.

    Jim Dempsey



    Blog: The Parallel Void
    www.quickthreadprogramming.com

    magicfootNovember 7, 2010 10:09 AM PST
    Rate
     
    Characterizing Manycore Memory Access

    Jim,

    With a system of any two Nehalem coupled via QPI, how are large sections of memory allocated in the first place? Surely multiple users or jobs running on the pair will have an effect on memory location and produce different results for different jobs. How are large amounts of memory distributed between the two chips ?

    I find that for jobs that use large amounts of memory, the processing time actually increases when I use more than 16 threads on the MTL config.

    Does the assumption that this increase is owing to the larger time penalty for the remote memory fetch for some local threads  sound correct ? i.e if all the fetches were only from local memory the jobs should not run longer. This phenomenon does not work for small memory sizes < 100Mb.

    I would upload a jpg with a graphic but can not figure how to do so directly to this page.

    Regards, magicfoot




    dinwalNovember 7, 2010 8:55 PM PST
    Rate
     
    Characterizing Manycore Memory Access

    Quoting magicfoot

    Jim,

    With a system of any two Nehalem coupled via QPI, how are large sections of memory allocated in the first place? Surely multiple users or jobs running on the pair will have an effect on memory location and produce different results for different jobs. How are large amounts of memory distributed between the two chips ?

    I find that for jobs that use large amounts of memory, the processing time actually increases when I use more than 16 threads on the MTL config.

    Does the assumption that this increase is owing to the larger time penalty for the remote memory fetch for some local threads  sound correct ? i.e if all the fetches were only from local memory the jobs should not run longer. This phenomenon does not work for small memory sizes < 100Mb.

    I would upload a jpg with a graphic but can not figure how to do so directly to this page.

    Regards, magicfoot


    Dear MagicFoot,

    According to Mike in this post (http://software.intel.com/en-us/forums/showthread.php?t=77872&p=1#133744) the memory is fully populated. I am not sure if the memory access is based on interleaving, or is it NUMA; but in my experience the kind of gap in memory performance, that we are talking about here, is not due to that. One does not explicitly control the placement of data in physical memory but OS usually does a good job.

    To upload a file you will have to first upload that pic to an image server and get a URL which then you can use here. I guess that is the only way to do it. BTW, the memory modules are 4GB each, I wonder what do you refer as local memory. Is there any particular reason for that number 100Mb? Thanks.

     

    Regards,

    Dinesh Agarwal



    Dinesh Agarwal

    magicfootNovember 9, 2010 10:25 AM PST
    Rate
     
    Characterizing Manycore Memory Access

    Hi Dinesh,

    I would like to know how the memory management is handled with QPI for applications with very large memory requirements. In the architecture diagram included here, and I take it that it reflects what is at the MTL, then how does the system handle the memory allocation. Should I even care?

    The second diagram, a graph,  indicates that the processes I am running with 32 threads take more time to end than processes running with only 16 threads. This goes back to my initial question about remote memory access causing this(see arch diagram).

    It may be of interest that the processing time increase past the Amdahl minimum happens with openMP on SUN SMP platforms too(and others). It does not happen with MPI though.

    I am afraid that I will analysing each openMP thread next to see what is happening and that seems like hard work. Does anyone want to have a guess as to why the processing time is increasing for the larger number of threads as shown in lower diagram?
    Architecture at MTL

    FDTD processing time vs threads

    dinwalNovember 10, 2010 11:26 AM PST
    Rate
     
    Characterizing Manycore Memory Access

    Hi,
    The communication diagram is fine but I guess for MTL the access of remote data (from memory connected to another processor's channel) happens through the QPI link. However, I am still of the opinion that no matter what kind of data parallel application you have, throwing more cores will show some benefit for sure. 
    It looks like there is excessive synchronization that is killing the performance for 32 threads. I would suggest try using a performance analyzing tool such as VTunes to see what part of your program is the bottleneck. My gut feeling is still that the synchronization is the overhead. 
    Regards,


    Dinesh Agarwal

Forum jump:  

Intel Software Network Forums Statistics

17,025 users have contributed to 48,319 threads and 172,758 posts to date.

In the past 24 hours, we have 11 new thread(s) 54 new posts(s), and 47 new user(s).

In the past 3 days, the most popular thread for everyone has been Optimalization of sine function\'s taylor expansion The most posts were made to Most likely, the issue is that The post with the most views is Optimalization of sine function\'s taylor expansion

Please welcome our newest member redfruit83


For more complete information about compiler optimizations, see our Optimization Notice.