I have been benchmarking a cluster with two MIC cards per node and noticed unusual behavior. Performance has always varied between nodes for whatever reason, but the second MIC card has never achieved the expected 760 GFLOPS for the DGEMM benchmark. All runs were done in native mode and separtately for each card. I have attached a plot that shows the average performance for a subset of nodes. According to the system administrator, all nodes have the same configuration and settings. Can anybody explain this behavior?
For more complete information about compiler optimizations, see our Optimization Notice.