For the memory bandwidth, I have noticed that the STREAM benechmark can achieve > 100GB/sec, however, after I look at the code and do some experiments by myself, I found it is very tricky to achieve such high memory bandwidth. I think there are two major tricks:
1. it uses static global arrays, rather than dynamically allocated arrays, when I use the dynamically allocated arrays, the bandwidth is much lower
2. the data is touched once before the real evaluation, I know it tries to remove some overhead, but if not touched, the bandwidth of the first scan is very low
finally, it uses openmp, when i use pthread to try to do the same experiment, the bandwidth is really low (~3GB/sec). I attach the source code of openmp and pthread. Correct me if I am wrong for this experiment. Thanks very much!