Out of order execution

Out of order execution

Is there a simulator and/or ageneral procedure one can follow to predict what instructions will be executedin what order (assuming all data is in the L1 cache)? I'm having a hard time comprehending why a given instruction sequence executes much faster than another. I suspect it's due to the out of order execution and register renaming, but I've found no tangible reason yet. Any help would be appreciated.

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - tthsqe
Is there a simulator and/or a general procedure one can follow to predict what instructions will be executed in what order (assuming all data is in the L1 cache)? I'm having a hard time comprehending why a given instruction sequence executes much faster than another. I suspect it's due to the out of order execution and register renaming, but I've found no tangible reason yet. Any help would be appreciated.

I do not know about such a simulator, though I would be interested in one myself.

Some more ideas where your speed difference can come from:

  • misaligned target - the instruction fetch can load 16 bytes (aligned) in one cycle. If your target is misaligned the number of instructions to decode in one cycle decreases.
  • If your instructions are large, changing the ordering changes the instruction decoder throughput. I don't know the details, but I've seen a simple change from using xmm[0-7] to using xmm[8-15] degrading performance noticeably, because the latter needs a larger coded instruction.

On that topic I recommend to look at sections 2.1.2.2, 2.2.2, and 3.4 of the Intel Optimization Reference Manual.

Perhaps if you could provide the two snippets of assembly someone could analyse the differences. (Please provide a few leading and trailing instructions too, so we can see context).

Without the code it would be complete speculation on why one bit runs faster than another.

Quoting - craigj0
Perhaps if you could provide the two snippets of assembly someone could analyse the differences. (Please provide a few leading and trailing instructions too, so we can see context).

Without the code it would be complete speculation on why one bit runs faster than another.

You will recognize it as the mandelbrot iteration. The second method trades a multiplication for and addition and uses one less instruction. However, the first method is 5-10% faster, depending on how much the loop is unrolled. I think it might be that the multiplications can be done together with the additions and moves from the previous block, but I am not sure about this. Any analysis would be appreciated.
method1:
align 16 ; 0 1 2 3
@@: ;--------------------------
movaps xmm3, [twos] ; X Y 2
mulpd xmm3, xmm1 ; X Y 2Y
mulpd xmm1, xmm1 ; X YY 2Y
mulpd xmm3, xmm0 ; X YY 2XY
mulpd xmm0, xmm0 ; XX YY 2XY
movaps xmm2, xmm0 ; XX YY XX 2XY
subpd xmm0, xmm1 ; XX-YY YY XX 2XY
addpd xmm1, xmm2 ; XX-YY XX+YY XX 2XY
addpd xmm0, [x] ; newX XX+YY XX 2XY
movaps [test1],xmm1 ; newX XX+YY XX 2XY
movaps xmm1, [y] ;newX y XX 2XY
addpd xmm1, xmm3 ;newX newY

movaps xmm3, [twos]
mulpd xmm3, xmm5
mulpd xmm5, xmm5
mulpd xmm3, xmm4
mulpd xmm4, xmm4
movaps xmm2, xmm4
subpd xmm4, xmm5
addpd xmm5, xmm2
addpd xmm4, [x]
movaps [test2],xmm1
movaps xmm5, [y]
addpd xmm5, xmm3

movaps xmm3, [twos]
mulpd xmm3, xmm7
mulpd xmm7, xmm7
mulpd xmm3, xmm6
mulpd xmm6, xmm6
movaps xmm2, xmm6
subpd xmm6, xmm7
addpd xmm7, xmm2
addpd xmm6, [x]
movaps [test3],xmm1
movaps xmm7, [y]
addpd xmm7, xmm3

method2:
align 16 ; 0 1 2 3
@@: ;-------------------------
movaps xmm3,xmm1 ; X Y Y
mulpd xmm3,xmm1 ; X Y YY
addpd xmm1,xmm1 ; X 2Y YY
mulpd xmm1,xmm0 ; X 2XY YY
addpd xmm1,[y] ; X newY YY
mulpd xmm0,xmm0 ; XXnewY YY
movaps xmm2,xmm0 ; XX newY XX YY
addpd xmm2,xmm3 ; XX newY XX+YY YY
movaps [test1],xmm2 ; XX newY XX+YY YY
subpd xmm0,xmm3 ; XX-YY newY XX+YY YY
addpd xmm0,[x] ; newX newY XX+YY YY

movaps xmm3,xmm5
mulpd xmm3,xmm5
addpd xmm5,xmm5
mulpd xmm5,xmm4
addpd xmm5,[y]
mulpd xmm4,xmm4
movaps xmm2,xmm4
subpd xmm4,xmm3
addpd xmm2,xmm3
movaps [test2],xmm2
addpd xmm4,[x]

movaps xmm3,xmm7
mulpd xmm3,xmm7
addpd xmm7,xmm7
mulpd xmm7,xmm6
addpd xmm7,[y]
mulpd xmm6,xmm6
movaps xmm2,xmm6
subpd xmm6,xmm3
addpd xmm2,xmm3
movaps [test3],xmm2
addpd xmm6,[x]

Yes, on CPU models of the last 3 years, multiply and add can issue on the same cycle, so it may often be important to reduce the number of multiply instructions when those exceed the number of add instructions. However, your 1st method may have more opportunity for parallel expression evaluation within a single iteration, and shorter overall latency.
On the current (Core i7) CPUs, it may be worth while to keep the unrolling down within the limits where Loop Stream Detection is effective.
Do you get better performance than standard code compiled by vectorizing compiler?

In terms of a simulator, have you checked out the Intel Architecture Code Analyzer? It is intented to compare AVX to SSE kernels to estimate relative performance, but you can also compare 2 SSE kernels. The simulator assumes all data in the L1 cache and ideal out-of-order engine conditions. It is a free tool you can find here:

http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/

After looking at your code I was curious, so I pasted it into a project and ran it through IACA. My initial test seemed to show that method 2 should be about 10% faster (17 cycles throughputinstead of 19). I only spent about 5 minutes on this, so perhaps a few more tests would give some insight.

I've only started playing with IACA recently, so I'm still figuring it out myself. It doesn't simulate a specific processor, but it is intended to predict performance with Sandy Bridge, so it may not be as good for current generation processors. Specifically, there are 2 data load ports in the simulated hardware, and I think there may only be one in current processors.

Quoting - areid
In terms of a simulator, have you checked out the Intel Architecture Code Analyzer? It is intented to compare AVX to SSE kernels to estimate relative performance, but you can also compare 2 SSE kernels. The simulator assumes all data in the L1 cache and ideal out-of-order engine conditions. It is a free tool you can find here:

http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/

After looking at your code I was curious, so I pasted it into a project and ran it through IACA. My initial test seemed to show that method 2 should be about 10% faster (17 cycles throughputinstead of 19). I only spent about 5 minutes on this, so perhaps a few more tests would give some insight.

I've only started playing with IACA recently, so I'm still figuring it out myself. It doesn't simulate a specific processor, but it is intended to predict performance with Sandy Bridge, so it may not be as good for current generation processors. Specifically, there are 2 data load ports in the simulated hardware, and I think there may only be one in current processors.

Just to keep you updated, A new release of Intel Architecture Code Analyzer (1.1) is available with Nehalem and Westmere support (in adddtion to the Intel AVX version that was supported in the previous versions).

Let me know if you need further assistance using Intel Architecture Code Analyzer.

Tal

Quoting - Tal Uliel (Intel)

Just to keep you updated, A new release of Intel Architecture Code Analyzer (1.1) is available with Nehalem and Westmere support (in adddtion to the Intel AVX version that was supported in the previous versions).

Let me know if you need further assistance using Intel Architecture Code Analyzer.

Tal

Could you tell me how to use this code analyzer. I downloaded it, but all gotare a bunch of dll files, an application that does nothing when run, and a readme that is not much help. What steps do you go though to test a chuck of code?

Quoting - tthsqe

Could you tell me how to use this code analyzer. I downloaded it, but all gotare a bunch of dll files, an application that does nothing when run, and a readme that is not much help. What steps do you go though to test a chuck of code?

Hi,

To use Intel Architecture Code Analyzer you need to compile your source with start and end marks and than run the analyzer tool on it.

for example:
Source main.c:
#include "iacaMarks.h"

int main(){

IACA_START
__asm vandps xmm0, xmm0, xmm1
IACA_END
}

Compile the source using AVX supported Compiler (lets assume you've created a main.exe file).

now run iaca -o main.iaca.txt main.exe and the expected output will be written to the file main.iaca.txt.

I suggest to use -o option instead of redirection (>) of the output as the tool truncate the lines at char 80 when the output is the screen.

For further details please refer to the Intel Architecture Code Analyzer - User Guide Rev 1.1 available on the download page.

Tal

Quoting - Tal Uliel (Intel)

Hi,

To use Intel Architecture Code Analyzer you need to compile your source with start and end marks and than run the analyzer tool on it.

for example:
Source main.c:
#include "iacaMarks.h"

int main(){

IACA_START
__asm vandps xmm0, xmm0, xmm1
IACA_END
}

Compile the source using AVX supported Compiler (lets assume you've created a main.exe file).

now run iaca -o main.iaca.txt main.exe and the expected output will be written to the file main.iaca.txt.

I suggest to use -o option instead of redirection (>) of the output as the tool truncate the lines at char 80 when the output is the screen.

For further details please refer to the Intel Architecture Code Analyzer - User Guide Rev 1.1 available on the download page.

Tal

Tal, thanks -I just put the mov ebx,111 ... ... mov ebx,222 around the instructions, and everything worked.
areid, Could you tell me where you got the 17 and 19 from; the analyzer for the nehalem architecture predicted a 15 cycle total throughput for method2 and 12 for method 1.

Leave a Comment

Please sign in to add a comment. Not a member? Join today