When we use Nehalem processor, it's important to apply "First Touch" memory allocation for thread scalability.First Touch is applied to MKL routines?
Optimization of first touch is important for multiple socket NUMA CPUs. It will not be important on single socket CPUs such as Core i7. On multiple socket platforms, it requires a combination of events to work well:1. NUMA option selected in BIOS2. RAM channels populated equally on all CPUs3. OS with appropriate scheduling (current linux, or Windows 7 or the new server beta version)4. appropriate affinity setting, e.g. KMP_AFFINITY=compact, HT disabled, all cores used, or OMP_NUM_THREADS set to number of cores used and GOMP_CPU_AFFINITY set to 1 thread per core5. data "first touched" (initialized, usually in your own program code) by same CPU which will do the work on it, in static scheduled OpenMP parallel region6. MKL functions used also employ static scheduling, using full number of threads
It might be interesting if the MKL notes would reveal which MKL functions ought to benefit from first touch scheduling. I don't think that documentation exists, so it is up to you to make performance tests. Certain MKL functions will use a number of threads which depends on the data set size, so you would have to know that number and take it into account in your own program and environment variable settings.Note that the platforms come with the BIOS set to non-NUMA memory organization (cache lines alternating among memory banks, so that no strategy should result in more than 50% local references).If MPI is used with hybrid affinity settings to make each process local to one socket, it should take care of memory locality, implicitly providiing local first touch, when the NUMA BIOS setting is in effect.