I have noticed through rigorous testing, for a parallel cfd code that the fastest code is generated with the following options:
-i4 -r8 -O2 -fp -model precise
anything else added such as -xhost, or O3, (O0 and O1 are four times slower than the chosen flags) results in slower code.
The -fp-model precise is a must for these type of software, and additionally there are numerous intrinsic matmul functions in the code (max size 70x70) and even intel math kernel library calls for the DGEMM or others results in slower code. Additionally the code in unstructured which means that there is a lot of indirect memory access.
The same pattern has been noticed since intel fortran compiler version 7 and onwards when this code was developed.
Thank you in advance for any comments suggestions.