about the memory bandwidth

about the memory bandwidth

Hi all,

For the memory bandwidth, I have noticed that the STREAM benechmark can achieve > 100GB/sec, however, after I look at the code and do some experiments by myself, I found it is very tricky to achieve such high memory bandwidth. I think there are two major tricks:

1. it uses static global arrays, rather than dynamically allocated arrays, when I use the dynamically allocated arrays, the bandwidth is much lower
2. the data is touched once before the real evaluation, I know it tries to remove some overhead, but if not touched, the bandwidth of the first scan is very low

finally, it uses openmp, when i use pthread to try to do the same experiment, the bandwidth is really low (~3GB/sec). I attach the source code of openmp and pthread. Correct me if I am wrong for this experiment. Thanks very much!

AdjuntoTamaño
Descargar bandwidth.cpp2.16 KB
publicaciones de 7 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Hi Mian L., Following these advices in http://software.intel.com/en-us/forums/topic/382760 one should be able to reach ~150GB/s or more. Your benchmark is different only in that it runs natively on Phi instead of an offload section on Xeon. Thanks, Evgueni.

Hi Evgueni,

Thanks. Since I am using pthread, it is different from that article. I have tried the method in that article, it can achieve the normal bandwidth, but not in my pthread program.

pthreads should be affinitized using sched_setaffinity declared in sched.h

Touching the memory before first use ensures it is mapped when you need it.  This mapping can take quite a while.  Since you are benchmarking memory bandwidth, I propose that it is fair to make sure all the pages are mapped and avaiable before you try to benchmark copying data between them.  :-)   Statically allocated arrays are propably mapped at program startup.  Dynamic arrays will need to be touched after allocation to be mapped, either by an initialization loop or first program use (remember, Linux doesn't actually map a page until you request it...which is why it isn't safe to see if your allocations failed by the return status of malloc()).

Other complicating factors in your code:  the memory initialization loop in the OpenMP code also serves to start up the OpenMP thread pool, so you don't benchmark the cost of thread creation and OpenMP runtime startup when you run test_omp.  Your pthread, code, however, is timing thread creation in addition to the work done by the worker.  Maybe start the threads and have them wait on a lightweight sync object until they are all created, then start timing and give them a "go" signal?

Hi Charles, thanks for the explain. I am more clear now. However, I wonder what exactly the meaning of "mapping"? Is it means to map physical addresses to virtual addresses? Thanks very much.

Cita:

Charles Congdon (Intel) escribió:

Touching the memory before first use ensures it is mapped when you need it.  This mapping can take quite a while.  Since you are benchmarking memory bandwidth, I propose that it is fair to make sure all the pages are mapped and avaiable before you try to benchmark copying data between them.  :-)   Statically allocated arrays are propably mapped at program startup.  Dynamic arrays will need to be touched after allocation to be mapped, either by an initialization loop or first program use (remember, Linux doesn't actually map a page until you request it...which is why it isn't safe to see if your allocations failed by the return status of malloc()).

Other complicating factors in your code:  the memory initialization loop in the OpenMP code also serves to start up the OpenMP thread pool, so you don't benchmark the cost of thread creation and OpenMP runtime startup when you run test_omp.  Your pthread, code, however, is timing thread creation in addition to the work done by the worker.  Maybe start the threads and have them wait on a lightweight sync object until they are all created, then start timing and give them a "go" signal?

Intel Xeon Phi coprocessor and its host both use "virtual memory," an address space unique to a process whose addresses are "mapped" to physical pages in memory when they exist there.  Thus, malloc can return a range of virtual addresses that currently have no physical memory addresses attached to them.  The act of accessing them on Linux will force the page mapping if a free page is available, but it can take some time.

Inicie sesión para dejar un comentario.