Batch nodes: 32 cpus 5x faster than 33 cpus

Batch nodes: 32 cpus 5x faster than 33 cpus

This is the way I typically submit batch jobs:

qsub -l select=1:ncpus=40 rl-myjob

Ever since the Memorial Day weekend maintenance, jobs submitted this way have been running about 5 times slower than than they do on the login node. I traced the problem to ncpus values greater than 32. For example, on a small test that uses 64 threads and normally runs in under 30 seconds:

qsub -l select=1:ncpus=32 rl-myjob
# Finishes in about 24 seconds

qsub -l select=1:ncpus=33 rl-myjob
# Takes over 120 seconds

Where rl-myjob looks something like this:

#PBS -N myjob
#PBS -j oe
#PBS -l walltime=0:15:00
cd ~/threading
./myprogram 64

For a clue about what might be going wrong see Mike Pearce's March 15 announcement of the upgrade to 40 cores:

Quoting Mike Pearce (Intel)
On the Linux side, we have added a 40-core batch node to our existing cluster. To run jobs on this node, you should include the following arguments to your qsub command:

qsub -l select=1:ncpus=

Replace xx with the number of CPUs that you want to test with, if greater than 32, then your job will be scheduled on the new 40-core batch node (acano04).

Note: all batch nodes are currently configured with Hyperthreading off.

So the problem could be specific to acano04. Choose 32 CPUs or fewer and you get a different batch node.

Also, has Hyperthreading really been off from the beginning? The MTL was advertised as a "40-core (80-thread) development environment". The C function sysconf(_SC_NPROCESSORS_CONF) now returns 40 on acano04. If I recall correctly, it returned 80 last month. No problem either way. I just want to choose an appropriate number of threads for the configuration.

- Rick

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

you are correct any job submitted with ncpus > 32 (and< 41)will be scheduled on the 40-core batch node (acano04). And HT is off on all batch nodes. Also there is an error in the advertiised material in that we have HT off, so potentially no 80-threads.

Also I'm sure you're aware there is no correlation between OMP_NUM_THREADS and the number of cores (ncpus)on the servers.

I will look into the anomaly between running onthe 32-core and 40-core batch nodes.

you are right-on that the40-core batch node is performing 3-5 times slower on the same task against the 2 other batch nodes.Until I can determine the underlying issue, I have taken this 40-core batch node down. I believe all TC judging will now be performed using the 32-core systems - until this anomaly is fixed.

Thanks Mike. I observed exactly the same results. Hopefully, the 40-core node will be back soon.-Nam

The 40-core Linux batch node (acano04), is back on-line and appears to be performing as expected. Let us know if you still see issues at:

Thanks, Mike. It actually has been peforming better than expected recently.

When you first took acano04 offline, I observed that acano03 was performing about 45% slower than acano02. This could explain why you said "the40-core batch node is performing 3-5 times slower on the same task against the 2 other batch nodes". Perhaps it was 3 times slower than acano03 and 5 times slower than acano02. Anyway, I got in the habit of calling hostname inside batch scripts to be sure I was comparing apples to apples on benchmarks.

Now that acano04 is back online, all batch nodes seem to be performing at the same level. The curious thing is that hostname still says acano02 or acano03 even when 40 cores are requested (qsub -l select=1:ncpus=40). Is that normal? sysconf(_SC_NPROCESSORS_CONF) returns 40 so it would appear that the process is actually running on the 40-core batch node acano04.

- Rick

Your analysis is correct - all batch nodes now have 40-cores (acano02, acano03 & acano04) - We expect to upgrade the login node (acano01)to 40-cores/80-threads shortly.

I can notice the same behaviour now: when I run my openmp-program with the number of threads less or equal to 38 everything is scaled well. With 39 threads it performs at the same level as with 2-4 and with 40 many times slower than with a single thread. I suspected NUMA-effect and excluded all shared memory access from the paralleled sections, but nothing has changed.UPD: All the same it was NUMA-effect, I simply missed one place with shared memory access.

Leave a Comment

Please sign in to add a comment. Not a member? Join today