I'm writing to see if someone could help me understand an issue in our solver that recently came up while using Vtune Amplifier. I'll try and describe this here:
Using vtune amplifier we see that the time spent in a function "mucal" goes up as number of threads increase. On 8 threads, mucal is at the top of the list.
mucal is a function that calculates viscosity. This is called in the following manner.
CFD mesh First cell index: 1
CFD mesh Last cell index: iend
OpenMP threads split ijk index.
Inside mucal function we use 2 modules and include 6 common blocks.
Modules have arrays of size (1:iend). These are mostly 1D arrays that store velocity, pressure etc. Common blocks has mostly scalar variables but a lot of them.
To fix this, we tried the following:
(1) Instead of using array modules inside mucal, pass that ijk value to mucal function (eg. mu(ijk)=mucal(ijk,iopt,u(ijk)). This did not help.
(2) Instead of including common blocks, again pass those variables to mucal function. This also did not help
(3) Calculate and store mucal(ijk) in a separate new array and then re-use that array, thereby reducing number of calls to the function mucal. This helped and for 8 threads mucal was no longer at the top of the list.
My question is why does time spent in mucal increase with number of threads? Is it a combination of using common blocks and modules or something else? What's the best approach to prevent issues like this?