I am running some multithreaded benchmark programs in Mic. Some programs don't scale beyond 32 threads and some beyond 64. I am trying to find out the reason why they are not scalable beyond certain number of threads. Definitely, the poor scaling is not a result of lack of computing resources (i,e we can run 244 hw threads without the problem of context switching).
I am trying to analyze this using Vtune but am still not sure how to study this issue.
1.Vtune Locks and waits analysis doesn't work in Knc (mic). So I don't know how to find whether the locks are the issues?
2. Bandwidth? As more threads are spawned and if they use lot of shared data, there can be an issue of cache coherence eating up the bandwidth which can be studied using core/uncore bandwidth measurement studies using Vtune.
I am not sure of anything else which might contribute to the poor scaling. I would like to take your suggestion in this study.