Customer asked to close it .

What specifically is your question?  You are running a benchmark where the bottleneck is expected to be in memory bandwidth.  With a non-precise event, you can't control whether VTune attributes the counts to the responsible instruction or to one which is waiting for that instruction to complete.

Are you trying to compare event rates?  If so, you need to set sample-after values to convenient ratios.  If you are comparing runs with and without mfence, it may help if you assure that the same sample-after values are used in each case.

