I've got a very long running simulation which has a very simple OpenMP structure: Some serial setup and then a bunch of threads start computing, only interacting when they write to a standard output array which is managed by thread 0. Fortran, windows, OpenMP part is off-loaded and very little data comes back (and the amount of data does not vary with number of threads).
I attach a plot which shows the productivity of the same code on three machines: a pair of E5s, and I7, and a Phi. They all access an 80 meg data set for testing (latter, 6 Gig).
As you can see from the plot, the I7 is better than the E5s but has the same slope. From the E5 testing up to 32 threads (= all it has), the scaling is well behaved (the bump at 12 threads is real but don't care). So we were surprised at many things about the MIC behavior. First we expected it to be straight to 56 threads (we chose, for this plot, to allocate the cores first through 56, then linearly added the threads). Letting the PHI chose the affinity did no good.
According to Vtune, all the threads are full bore. What is especially odd is the the slope becomes good after we start assigning the third thread per processor but too late, the damage has been done. The terrible bump starts at about 8 threads, each on its own cpu. Is this some issue with the cache structure? All the threads are constantly accessing the same shared data set all the time. If so, why doesn't it just continue to have a bad slope? Or again, has all the damage been done?
Any ideas where to look, folks?