false dependencies

false dependencies

Hi,

What are the rules for determining false dependencies and
what are the penalties for such dependencies?

For example, in

mulsd (%eax),%xmm1
movhlps %xmm1,xmm2

  1. will 'movhlps' depend on 'mulsd'?
  2. will be issuing of 'movhlps' delayed and for how long?

note that there is no functional dependency between these two instructions.

Also, will it be any dependency in

mulsd (%eax),%xmm1
unpcklpd %xmm2,xmm1

and when 'unpcklpd' will be issued?


Thank you,

David Livshin

http://www.dalsoft.com
4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The question occasioned some confusion, even among the experts.
1) I assume you are using default linux syntax, otherwise the question doesn't make sense.
2) I assume you are using "false dependency" in the sense where the dependency might ideally be resolved if the use of upper half of the register were not delayed pending modification of the lower half.
2) If your code is to run well on all hardware models, it must not depend on upper half register contents being available, before completion of partial register write. Your examples could perform well on certain obsolescent 32-bit models, but not on 64-bit capable CPUs (even in 32-bit mode).

The question occasioned some confusion, even among the experts.
1) I assume you are using default linux syntax, otherwise the question doesn't make sense.

Yes, AT&T syntax is used:

mulsd (%eax),%xmm1 is %xmm1 = %xmm1 * (%eax)
movhlps %xmm1,%xmm2 is low(%xmm2)=high(%xmm1)

Because mulsd sets only the low part of %xmm1 and movhlps uses only the high part of %xmm1, these two instructions are functionally independent. However it is not clear if mulsd will "lock" %xmm1 till it completion thus preventing movhlps to be issued before that.

2) I assume you are using "false dependency" in the sense where the
dependency might ideally be resolved if the use of upper half of the
register were not delayed pending modification of the lower half.

Exactly!

2) If your code is to run well on all hardware models, it must not
depend on upper half register contents being available, before
completion of partial register write. Your examples could perform
well on certain obsolescent 32-bit models, but not on 64-bit capable
CPUs (even in 32-bit mode).

Actually I was made aware that "older" models fail to resolve this dependency (movhlps in the above example will be delayed till the completion of mulsd ) while the new processors are able to execute the above code in parallel. The goal of my inquire was to verify that.

David Livshin

http://www.dalsoft.com

As I understood it, only the last few 32-bit only laptop CPU models avoided the dependency. My laptop is such a one, but I don't see them advertised for sale "new." The latest Intel compilers warn that the code generation option for these is "deprecated" (may not be fully effective, and will disappear soon). They never did observe the particular importance of minimizing the number of registers used on these models. If I were to speculate, it seems connected with the fact that this particular family of CPUs implemented 128-bit operations as pairs of 64-bit operations issued 1 clock apart. So, the penalty for potentially doubling the throughput by full 128-bit operations on all the latest CPUs includes the kind of dependency you have pointed out.

Leave a Comment

Please sign in to add a comment. Not a member? Join today