I have a very simple code that I use as a test input for a binary analysis tool. The code performs a naive matrix multiplication. I am compiling this code on an Intel Xeon E5-2690 machine using option -O3 -g -xHost.
The Intel compiler 13.x with the above options would perform loop interchange, moving the reccurence from the innermost loop to an outer loop. Version 14.0.0 of the compiler does not perform this transformation. Both versions unroll one of the loops 16 times, filling up 4 AVX vectors. However, version 14.0 also generates many more address arithmetic instructions in the innermost loop. The end result is that the code produced by version 14.0 takes 50% to 3x longer to execute for matrix sizes >= 40.
There is no value in this naive matrix multiply code, but I am trying to understand what changed with the new compiler that it failed to interchange the loops, and I wonder if this change can possibly affect other codes whose performance is actually relevant. Are there any command line flags that would enable the previous behavior?
PS: I am not allowed to attach files. Pasting the code in the message body, or including a link to pastebin triggers the spam filter. What's the appropriate way to include sample code?