The cores and vector processors on modern multi-core processors are voracious consumers of data. If you do not organize your application to supply data at high-enough bandwidth to the cores, any work you do to vectorize or parallelize your code will be wasted because the cores will stall while waiting for data. The system will look busy, but not fast.
The following chart, which reflects the numbers for a hypothetical 4-processor system with 18 cores per processor, shows the:
This article provides a recipe for how to obtain, compile, and run an optimized version of Parallel Ocean Program (POP) 2.0.1 with a bench01 (0.1 degree high resolution) workload on Intel® Xeon® processors and Intel® Xeon Phi™ processors.
The source for this version of POP 2.0.1 as well as the bench01 workload can be obtained by contacting Prof. Zhenya Song at email@example.com.