This post is regarding benchmarking algorithms on the Intel Xeon processors.
I have been attempting to reproduce the benchmarks as provided in the code from the article above. Specifically mmatest1.c from the zip file attached in the article. One observation I have is that there is a considerable warm-up time which leads to big overhead on the first algorithm being benchmarked. (In this case, the cblas_sgemm function.)
16 loop counts are often not enough to offset the thread 'warm-up' time. I am not sure what the correct terminology for this would be.
- Can anyone confirm this? When benchmarking, is it better to give a 'warm-up' kernel to the threads?
- Where can i read up more on this?
- Can anyone also suggest the best way/algorithm/function to access sub matrices of size (MxM) from a larger matrix?
To review my code, kindly refer to: GitHub - akhauriyash/XNOR-Nets: An OpenMP parallelized implementation of XNOR kernels.