Well this is embarrassing. My program runs fine on the Manycore Testing Lab as long as it's singlethreaded. Whenever I enable multithreading the performance gets worse on the same input. Starting more threads only makes it slower.
At first I thought it was just poor design on my part - too much contention for mutexes or something. However, now I'm testing a simplified design with NO synchronization between threads and it's still happening. The performance is what I'd expect from a single core processor.
This is on the login node (acana01). Does it have access to the 40 cores? Do I have to do anything special to enable all of them?