Characterizing Manycore Memory Access

Characterizing Manycore Memory Access

(cross-posted from my blog post on the Intel Software Blog):

Memory access characteristics in manycore NUMA systems are not always obvious to the programmer. A process may see widely varying latency and bandwidth for memory accesses depending on which CPU the process is running and on which memory node the data is located.

My initial results show that the Intel MTL machines exhibit nearly-constant memory access latency and bandwidth varying by up to 2 GB/s, compared to another architecture which exhibits latencies varying by up to 200ns and memory bandwidth varying by up to 4 GB/s. This speaks well of the Intel design, as latency-sensitive applications may not notice effects of the NUMA architecture when running on MTL machines. For bandwidth-sensitive applications, however, NUMA still presents a significant programming design challenge regardless of the architecture.

Further tests are underway to determine how individual cores' memory access might vary, followed by higher-level application benchmarking in order to better understand the effects of manycore NUMA designs on high-performance computing applications.

9 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de jeffrey-gallagher (Intel)

Nice posting, braithr! I and MANY OTHERS are eager to hear the results of your further tests!

Cheers,
jdg

Portrait de jimdempseyatthecove

Braithr,

Welcome to the ISN forums.

When running your latency tests it is important to know the physical and BIOS setup of the system under tests. The MTL log-on and batch systems are configured differently. Also, the particular version of Linux and libnuma may affect your latency tests. What is unknown from a MTL user viewpoint is

Is/are the systems configured to prefetch the next cache line?
Is/are all memory sockets populated?
Is/are all RAM sticks the same and w/wo ECC?
Is/are the motherboards the same?
Is/are the BIOS settings misc. settings the same?
(log on system has HT enabled, the batch systems do not)
Is the Linux system configured with First Touch policy?
if so, does your app touch all of of allocations by the thread on the node of interest prior to running test.

A narrow gap in the latency and bandwidth may be indicative of the configuration working against your programming intention.

One of my collegues provided me with access to two dual processor Xeon 5570 systems. Virtually identical systems excepting for version of the Linux. My test app was measuring throughput. To our surprise, there was a significant difference in throughput between these two "identical" systems. Unfortunately we only had remote access to these systems so we couldn't inspect the systems as closely as we wanted to.

It might be helpful for your purposes if the MTL would document the configurations in detail, and update this document as changes are made to the system.

BTW, the MTL systemsare quite impressive.

Jim Dempsey

www.quickthreadprogramming.com
Portrait de dinwal

That is an interesting observation. I would like to know what kind of application was used to observe this behavior. I am assuming this was a memory intensive application such as matrix transpose or multiplication of large matrices, but a confirmation would help. I would also like to know what was the maximum memory bandwidth achieved? Also, if you did run some benchmarks such as STREAM etc., comparing the maximum bandwidth reported by the benchmark with your application will be informative. If you did not do the benchmarking, I would like to do the STREAM and possibly few other tests if you can post the maximum bandwidth achieved by your application along with the memory requirements.Thank you for sharing your observation with us.

Portrait de jimdempseyatthecove

dinwal,

Not having the two supposidly identical systems at my disposal, I cannot make a determination as to why the performance was measurably different.

I suspect, but cannot confirm, that the memory manager in Linux (between first system and second system) enforced a first touch policy as opposed to which thread allocates pollicy.

There are performance problems relating to using a "first touch" pollicy as opposed to which thread allocates.

1) Forces the allocating thread to go out and touch the memory prior to (say) passing the memory block pointer over to an I/O thread to read-in the data.

2) Experiences page faults for each page as you walk the memory pages after the allocation but prior to first use by work in your application.

3) May require two additional thread context switches to ensure buffer properly "first touched"

... and a few others.

"first touch" has its place, but it is not always the best choice.

Jim Dempsey

www.quickthreadprogramming.com

Jim,

With a system of any two Nehalem coupled via QPI, how are large sections of memory allocated in the first place? Surely multiple users or jobs running on the pair will have an effect on memory location and produce different results for different jobs. How are large amounts of memory distributed between the two chips ?

I find that for jobs that use large amounts of memory, the processing time actually increases when I use more than 16 threads on the MTL config.

Does the assumption that this increase is owing to thelarger time penaltyfor the remote memory fetch for some local threads sound correct ? i.e if all the fetches were only from local memory the jobs should not run longer. This phenomenon does not work for small memory sizes < 100Mb.

I would upload a jpg with a graphic but can not figure how to do so directly to this page.

Regards, magicfoot

Portrait de dinwal

Quoting magicfoot

Jim,

With a system of any two Nehalem coupled via QPI, how are large sections of memory allocated in the first place? Surely multiple users or jobs running on the pair will have an effect on memory location and produce different results for different jobs. How are large amounts of memory distributed between the two chips ?

I find that for jobs that use large amounts of memory, the processing time actually increases when I use more than 16 threads on the MTL config.

Does the assumption that this increase is owing to thelarger time penaltyfor the remote memory fetch for some local threads sound correct ? i.e if all the fetches were only from local memory the jobs should not run longer. This phenomenon does not work for small memory sizes < 100Mb.

I would upload a jpg with a graphic but can not figure how to do so directly to this page.

Regards, magicfoot

Dear MagicFoot,

According to Mike in this post (http://software.intel.com/en-us/forums/showthread.php?t=77872&p=1#133744) the memory is fully populated. I am not sure if the memory access is based on interleaving, or is it NUMA; but in my experience the kind of gap in memory performance, that we are talking about here, is not due to that. One does not explicitly control the placement of data in physical memory but OS usually does a good job.

To upload a file you will have to first upload that pic to an image server and get a URL which then you can use here. I guess that is the only way to do it. BTW, the memory modules are 4GB each, I wonder what do you refer as local memory. Is there any particular reason for that number 100Mb? Thanks.

Regards,

Dinesh Agarwal

Hi Dinesh,

I would like to know how the memory management is handled with QPI for applications with very large memory requirements. In the architecture diagram included here, and I take it that it reflects what is at the MTL, then how does the system handle the memory allocation. Should I even care?

The second diagram, a graph, indicates that the processes I am running with 32 threads take more time to end than processes running with only 16 threads. This goes back to my initial question about remote memory access causing this(see arch diagram).

It may be of interest that the processing time increase past the Amdahl minimum happens with openMP on SUN SMP platforms too(and others). It does not happen with MPI though.

I am afraid that I will analysing each openMP thread next to see what is happening and that seems like hard work. Does anyone want to have a guess as to why the processing time is increasing for the larger number of threads as shown in lower diagram?

Portrait de dinwal

Hi,The communication diagram is fine but I guess for MTL the access of remote data (from memory connected to another processor's channel) happens through the QPI link. However, I am still of the opinion that no matter what kind of data parallel application you have, throwing more cores will show some benefit for sure.It looks like there is excessive synchronization that is killing the performance for 32 threads. I would suggest try using a performance analyzing tool such as VTunes to see what part of your program is the bottleneck. My gut feeling is still that the synchronization is the overhead.Regards,

Connectez-vous pour laisser un commentaire.