Anomalous performance on batch nodes

Anomalous performance on batch nodes

Hello,

I've been running experiments on MTL over the last week, and I'm seeing some strange results. I ran similar experiments on MTL last summer and was able to obtain consistent performance measurements for operations on four concurrent data structures. This time, taking performance measurements on six concurrent data structures, I'm not having much luck obtaining good results (i.e., small standard deviation, few anomalies). I'm starting to wonder if something is happening systemically.

My experiments were split up into batch jobs by the fixed number of cores the algorithms could each run on (i.e., a 1-core job, a 4-core job, 8, 12, ..., a 32-core job). I've run my entire experimental suite three times. The first time, the performance of each algorithm was good, scaling well except at 20, 24 and 32 cores, where it plummeted to around the level of single-core performance (see Figure 1 through Figure 3).

The second time, algorithms scaled well except at 16, 28 and 32 cores, where performance again plummeted to around single-core performance (see Figure 4).

Since the performance of the algorithms dropped to this uniformly poor level, the standard deviation of the runs associated with those "bad" data points was very small, so it gave me little numeric information to indicate which data were reliable. I attempted to address this by splitting the repeated trials for each experiment into two groups, that were run at different times (so that one of them being affected by this strange problem would be more likely to skew the standard deviation). I also split up the jobs (which were originally about 4 hours each) into many smaller jobs (each less than 1 minute), and re-submitted. This resulted in some well behaved data (see Figure 5), and many graphs displaying inconsistent data (see Figure 6).

For reference, when I run the same experimental suite on a 16-core Sun machine, and on a 4-core Intel Q9450, I get very consistent performance (albeit with a different ranking of algorithms on the Sun machine).

Any thoughts? Thank you for your time and attention.
Trevor

Figure 1: Part of experimental suite 1, showing anomaly at 20 cores.

Figure 2: Part of experimental suite 1, showing anomaly at 24 cores.

Figure 3: Part of experimental suite 1, showing anomaly at 32 cores.

Figure 4: Part of experimental suite 2, showing anomalies at 16, 28 and 32 cores.

Figure 5: Part of experimental suite 3, showing good results.

Figure 6: Part of experimental suite 3, showing inconsistent results.

13 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Are you using the qsub commands that request both the node and the full number of cores on the machine? If you do not, the default is to request 1/1, and someone who is requesting, say, 31 cores could be scheduled on the machine with you.

Apologies for not passing along the exact command-line switches -- my account just re-re-expired and I don't seem to have my submission scripts handy!

The MTL batch process, is configured as exclusive, i.e. once a job is submitted/running to a specificbatch node, no other job can be run on the same machine, until the first job finishes.

This is really strange, we have not changed the physicalarchitecture of the batch nodes, since they were initially installed. The only changes that were made where to update the kernel (to: 2.6.18-194.11.4.el5)and to configure the BIOS to fullly support NUMA mode.

Even though the login node is shared, can you try running saya smaller data set (against all the cores)on the login node to see if you get similar results. Note that the login node has a running renicer.

Thanks for the clarification, Mike! I didn't know that.

The only other thing I can think of, given that most of the weirdness occurs at 16+ cores is that you may be running into random NUMA effects. I believe the default NUMA policy on the machine right now is for pages to be allocated on one of the RAM risers associated with the processor that the thread is executing on, and the scheduling policy attempts to keep threads from migrating. Does that work for you? Or are there some interleavings where all of the data structures could end up allocated local to a single processor and then accessed from all of them? The processors are fully connected via QPI (25.6 GB/s), but each riser only has a DDR3 dual-channel (at 17.1 GB/s), so if you have all threads on other processors running full-out against a single other processor's RAM you will definitely run into trouble (we have a libnuma-aware version of the STREAM benchmark we use to test and measure several weird memory configurations such as this).

If the memory is supposed to be evenly distributed between the nodes, you can skip libnuma and just do the following in your script to force round-robin memory page allocation:
numactl --interleave=all

If the memory is supposed to be local, you can also use libnuma and thread pinning to control allocation. This strategy is what we use in our runtime (Manticore, a parallel dialect of ML) and what works best for our performance, though we are mostly functional and pretty heavily tuned to keep accesses thread-local.

Hope this helps, and good luck with your debugging!

Thank you for your feedback. Sorry it took me so long to get back to it, I've been away the last few days.

As was suggested, I ran some trials on the login node, but I was unable to replicate the issue. Unfortunately, I experience the issue roughly every 8 hours of experiments (not to say that it is periodic, but rather that it is somewhat rare), so it's hard to duplicate with small jobs.

I gave the numactl options a try (especially interleave=all), and I haven't seen the bizarre dips since then, but the instability (huge standard deviation like Figure 6 in my original post) is still a problem.

I ran numastat before and after running some tests, and took the difference, to see what the mem-access situation was like, and my output was:

node0
node1
node2
node3

numa_hit
271828
255194
254620
254876

numa_miss
0
0
0
0

numa_foreign
0
0
0
0

interleave_hit
254261
254180
253998
254226

local_node
77582
69608
65807
59180

other_node
194246
185586
188813
195696

It looks like memory is evenly distributed and accessed, with a bit of a bias on node0. (I believe the testing harness is running on this node, so some extra overhead makes sense.)

However if, for instance, I take ConcurrentSkipListMap, a data structure in the Java Foundation Classes (equivalent of the STL for C++) and do random add()+remove() calls (50% each) for three seconds at a time, and take measurements, I get very specific performance anomalies that lead to the bad standard deviation.

If it were spotty bad performance, I wouldn't be quite as surprised, but it's spikes of unusually good performance. Each trial starts from the same seed, so the only explanation I can muster is that the interleaving of operations results in a fortuitous execution, but why then would we have such specific spikes (essentially two tiers of performance)? (Pay no heed to the first few trials, their poor performance is an artifact of the well known "warm-up" period in which the Java VM's HotSpot compiler performs heavy run-time compilation optimizations.)

FYI the measured size of the data structure at the end of each trial varies by less than 1% over all trials, so it's not the case that the structure is smaller or larger in some trials. I limit the number of keys that can be used to the integers [1,10^6] and the structure fills up to 50% capacity (as expected from 50% add(), 50% remove()) fairly quickly. Every trial ends with size within 0.5% of 500,000 keys.

(For those who care, the methodology was: Each trial begins with the creation of a new instance of ConcurrentSkipListMap, and the spawning of 128 threads that will each add and remove key/value pairs with (50% probability of add, 50% probability of remove). Each thread is given a random seed that is generated by a standard RNG. The RNG that feeds all of these threads is given the same random seed at the beginning of each trial. The hope is that each trial should be reasonably close in behaviour to the previous ([only] the interleaving of threads will alter the execution). Garbage collection is then manually triggered to clean up after the previous trial. All threads wait on a barrier (a high-level synchronization primitive in Java) and once they are all waiting on the barrier, timing starts and all processes are released to begin adding and removing keys. Keys are drawn from a uniform random distribution over the integers [1,10^6]. Each thread checks the elapsed time after every 250 operations and, as soon as the elapsed time exceeds 3.0 seconds, a thread will stop operating on the structure and wait on an exit barrier. Once all threads are waiting on the exit barrier, they are all released, the elapsed time is computed, the trial's throughput is computed by dividing total operation count by elapsed time, the size of the structure is computed, and output for the trial is produced.)

Any thoughts?

> I ran numastat before and after running some tests, and took the
difference, to see what the mem-access
> situation was like, and my output
was:

numastat does not report on your actual access patterns. It only tells you about memory page allocation requests. A numa_miss for node0 would mean that the interleaved pattern wanted to allocate on node0, but there were no free pages and the allocation had to be performed on another node.

To get more information about our memory access locality and its effect on performance, we typically use linux hardware performance counters (L3 cache read/write miss, in particular). It might be worth doing this in your case. If you don't have to run cross-platform, there may even be some of the spiffy Intel perf tools that don't require you to change your code installed on the machine, though I have to admit I haven't tried any of them.

Also, do you track whether or not GC is being triggered during execution? Or at least how much pause time you get due to GC in a run so you can see whether it's the GC time or the program time or both that is varying?

Finally, have you tried varying the number of threads from 128? For memory-bound, high-locality work, we've experienced abysmal performance elbows adding even one or two threads beyond the number of available real cores (SMT threads have also been a loss).

Ah, okay. Thank you for the information. I'll look into querying that information.

As for GC tracking, the short answer is that I believe it is never triggered while a timed trial is running. The long answer is that each trial only consumes something like 1GB of memory, and the heap is initially 15GB. My understanding is that the GC will not trigger so long as memory pressure remains low and so, since garbage collection is manually performed before each trial, it shouldn't ever trigger. Anecdotally, before we started allocating a massive heap up front, GC would trigger, and it had a very visible (chaotic) impact on trial running times.

Regarding the number of threads, the short answer is that I've tried 4, 16, 32, 48, ..., 128 threads, and problems have occurred at 64, 80, 96, 112, 128 threads. The long answer is that we spawn 4 threads per core as we vary the number of cores, so all of the data we've gathered is for larger number of threads than cores. This computation is definitely memory-bound, but I'm not sure how much locality is an issue. All threads are accessing a tree structure which spans all threads' memory spaces, and the keys each thread modifies are equally likely to be located in any portion of the tree, so I'm fairly certain there will be a roughly uniform distribution of memory accesses across all threads' memory spaces, in each trial. That being said, my feeling is that it's not a locality issue. I could definitely be wrong, but I just lost my access to MTL, so I don't think I'll have an opportunity to figure that out.

Quoting trbot
Ah, okay. Thank you for the information. I'll look into querying that information.

As for GC tracking, the short answer is that I believe it is never triggered while a timed trial is running. The long answer is that each trial only consumes something like 1GB of memory, and the heap is initially 15GB. My understanding is that the GC will not trigger so long as memory pressure remains low and so, since garbage collection is manually performed before each trial, it shouldn't ever trigger. Anecdotally, before we started allocating a massive heap up front, GC would trigger, and it had a very visible (chaotic) impact on trial running times.

Regarding the number of threads, the short answer is that I've tried 4, 16, 32, 48, ..., 128 threads, and problems have occurred at 64, 80, 96, 112, 128 threads. The long answer is that we spawn 4 threads per core as we vary the number of cores, so all of the data we've gathered is for larger number of threads than cores. This computation is definitely memory-bound, but I'm not sure how much locality is an issue. All threads are accessing a tree structure which spans all threads' memory spaces, and the keys each thread modifies are equally likely to be located in any portion of the tree, so I'm fairly certain there will be a roughly uniform distribution of memory accesses across all threads' memory spaces, in each trial. That being said, my feeling is that it's not a locality issue. I could definitely be wrong, but I just lost my access to MTL, so I don't think I'll have an opportunity to figure that out.

Pardon my curiosity but I would like to know why do you use more than one threads per core (two would be ok but HT is off on MTL) if your application is memory bound.

Moreover, since you are running multiple threads per core I hope your application does not auto vectorize your code or there will be race conditions for XMM registers. Additionally, XMM registers do not support any coherency protocols but I believe your output is correct so that is not much of an issue here.

I am sure you can ask for extended access to MTL and you will get it. Intel folks are surprisingly nice about these things. Although, I would like to figure out the reason for this weird behavior, I will appreciate if you could answer my question on using more than one threads per core first.

> ...there will be race conditions for XMM registers...

Citation needed.

Quoting Daniel Amelang
> ...there will be race conditions for XMM registers...

Citation needed.

I believe that since there is only a fixed number of registers being used by multiple threads not all of them can use these at the same time. The first one (hopefully) of all will be able to store its data which eventually will be overwritten by someone else. This resource sharing, I believe, will lead to racing but then I would like to hear what experts have to say.I am sure the registers do not follow any coherency protocol but again I haven't read all of the literature yet.

> I am sure the registers do not follow any coherency protocol but again I haven't read all of the literature yet.

I suggest starting with a computer architecture textbook, say "Computer Architecture: A Quantitative Approach" by Hennessy and Patterson. That one has a nice section on SMT that explains why there is no "coherency protocol" for registers.

Quoting dinwal
Pardon my curiosity but I would like to know why do you use more than one threads per core (two would be ok but HT is off on MTL) if your application is memory bound.

Moreover, since you are running multiple threads per core I hope your application does not auto vectorize your code or there will be race conditions for XMM registers. Additionally, XMM registers do not support any coherency protocols but I believe your output is correct so that is not much of an issue here.

I am sure you can ask for extended access to MTL and you will get it. Intel folks are surprisingly nice about these things. Although, I would like to figure out the reason for this weird behavior, I will appreciate if you could answer my question on using more than one threads per core first.

Well, there were a couple of reasons why we wanted several threads per core. We wanted to see how things would work under high contention, especially when processes are put to sleep while holding locks. (Our algorithms are lock-free, but some competitors are lock-based.) Beyond that, if we have to compete with system threads for CPU time, we want to maximize our access and, coarsely speaking, more threads means more consistent access to all cores by "useful" processes. Additionally, we wanted to run the same number of threads as on another Sun system, where there are more hardware threads than here. (The Sun machine has 128; the same number I run here on 32 cores.)

As for auto vectorization, I doubt it. My understanding is that Java (our implementation platform) has limited support for auto vectorization and, besides, each thread is running through wildly branching instruction sequences over some 1500 lines of code.

Originally, I was really hoping some sort of obvious difference could be identified between the MTL's setup last summer and now, because the setup for our experiments last summer was virtually identical, and we didn't experience any issues. Somehow, the idea that using 4 threads/core is the problem doesn't seem to resonate with me.

Thanks for the encouragement; maybe after a bit more brainstorming and experimentation we'll request access to MTL again.

Accedere per lasciare un commento.