I read a report. The authors parallized a memory intensive application with multithreading (by pthreads). They tested the performance on a compute node of two 6-core Xeon X5650. They found the performance will increase with the number for threads, even if the number of theads(e.g., up to 2000 threads) is greater than the number of phisycal cores (i.e., 12). The author contributed it to "memory level parallelism supported by the latest multi-core processors, i.e., Intel Nehalem enables more than 10 outstanding memory requests".
I could not understand it. I know CPUs have load/store buffers to support out-of-order execution. But it is bound to a single thread. From this report, it looks a memory requested can be in flight during thread switch. Is it true? If not, how to explain this kind of things: performance goes up even when thread number exceeds core number?