branch prediction pattern

branch prediction pattern

Hi all,

Say I have the code below

Program Main
use ModSubroutines
Implicit None
Integer :: A, Stat
call SubA(A,Stat)
if(Stat==0) call SubB(A,Stat)
if(Stat==0) call SubC(A,Stat)
if(Stat==0) Then
write(*,*) "Success"
Else
write(*,*) "Error"
End If
End Program

In order to avoid uncontrolled termination of the program, a status variable "Stat" is carried all along the way. If it's value is different from zero, the program will give an informative error message and will close down gently. Since "Stat" changes from zero to one only if an error has occured which lead to program termination, it will always be zero in a normal program run. Thus "Stat" has a very predictable pattern and a sever slow down from this pattern is not an issue because the pogram will then terminate. Thus, if the compiler predicts the pattern of "Stat" from "Stat" only, all these if statement, and there can be thousand, will not slow down the run time of the program because the pattern will always be "FFFFFF....". However, if the compiler mixes the pattern of "Stat" with those of other branching variables such as input parameter, the "Stat" if statements will slow down the pogram. I read somewhere on the web that compiler will not store a pattern for every single branch, thus, mixing branches and variables. If this is correct for Ifort too, the question is whether there is any possibility of telling the compiler to create a predictive pattern for "Stat" branches only.

Thanks

Karl

34 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Program Main
  use ModSubroutines
  Implicit None
  Integer :: A, Stat
  call SubA(A,Stat)
  if(Stat!=0) goto 999
  call SubB(A,Stat)
  if(Stat!=0) goto 999
  call SubC(A,Stat)
999 continue
  if(Stat==0) Then
    write(*,*) "Success"
  Else
    write(*,*) "Error"
  End If
End Program

You will have to check the code to see if in release build it generates a branch or branch over and jump. If branches are too far (on IA32), try

Program Main
  use ModSubroutines
  Implicit None
  Integer :: A, Stat
  goto 111
666 continue
    write(*,*) "Error"
  goto 999
111 continue
  call SubA(A,Stat)
  if(Stat!=0) goto 666
  call SubB(A,Stat)
  if(Stat!=0) goto 666
  call SubC(A,Stat)
  if(Stat!=0) goto 666
  write(*,*) "Success"
999 continue
End Program

Fortran (IVF) does not have a !DEC$ to aid in branch prediction (at least not that I am aware of).

Note, if both cases generate a branch over a jump, then you will be unable to attain what you seek.

Jim Dempsey

www.quickthreadprogramming.com

Leave branch prediction to the processor - it's very good at it. I suggest not spending any time worrying about nano-optimizations of this nature. Write the program in a clear manner and let the compiler and processor do the rest. Run the program through Intel Vtune Amplifier XE and look for hotspots if you're dissatisfied with performance. You can ask VTune to show you branch mispredicts, but I predict (!) you won't find much of interest here.

Steve - Intel Developer Support

Thanks Jim an Lionel.

Will have a look at the VTune output.

Quote:

Steve Lionel (Intel) wrote:

Leave branch prediction to the processor - it's very good at it. I suggest not spending any time worrying about nano-optimizations of this nature. Write the program in a clear manner and let the compiler and processor do the rest. Run the program through Intel Vtune Amplifier XE and look for hotspots if you're dissatisfied with performance. You can ask VTune to show you branch mispredicts, but I predict (!) you won't find much of interest here.

Steve,

Can you comment on the Fortran 2008 BLOCK construct that has been introduced with the 2015 Beta compiler version and whether you'd recommend it in situations like these?  That is, do you think performance will be just as good, better, or worse compared to the traditional styles of nested IFs, GOTOs, etc.?  I personally prefer the clarity offered by the BLOCK construct, see code snippet below, and have started using it in new code with your maxim in mind i.e., let the compiler do the rest.  It'll be great to know if your compiler team has looked at any efficiency aspects when it comes to BLOCK constructs and if there are any caveats you can share here.

Thanks,

   PROGRAM p
   
      REAL :: A
      REAL :: B
      REAL :: C
      INTEGER :: Istat
      
      A = 1.0
      B = -1.0
      C = 0.0
      CalcBlock: BLOCK
      
         !.. Call with A
         CALL Sub(A, Istat)
         WRITE(6,*) " After call with A: Istat = ", Istat
         IF (Istat /= 0) EXIT CalcBlock
         
         !.. Call with B
         CALL Sub(B, Istat)
         WRITE(6,*) " After call with B: Istat = ", Istat
         IF (Istat /= 0) EXIT CalcBlock
         
         !.. Call with C
         CALL Sub(C, Istat)
         WRITE(6,*) " After call with C: Istat = ", Istat
         IF (Istat /= 0) EXIT CalcBlock
   
      END BLOCK CalcBlock
      
      STOP
      
   CONTAINS
   
      SUBROUTINE Sub(X, ErrorCode)
      
         !.. Argument list
         REAL, INTENT(INOUT)    :: X
         INTEGER, INTENT(INOUT) :: ErrorCode
         
         !..
         ErrorCode = 0
         IF (X < 0.0) THEN
            ErrorCode = 1
         END IF
         
         !..
         RETURN
         
      END SUBROUTINE Sub
   
   END PROGRAM p

 

I don't see the connection between BLOCK and efficiency. BLOCK is a scoping construct, it has no relationship I can divine to IF-THEN, etc. In your example, you simply used BLOCK as a way to label a group of code so that you could exit it by name. You could have used DO for that, I suppose. The main purpose of BLOCK is to be able to declare variables that are local to the block - it's very helpful in DO CONCURRENT to avoid loop-carried dependencies.

Steve - Intel Developer Support

Steve,

Yes, I understand BLOCK is a scoping construct with the ability to create locally scoped variables.

But what I'm trying to show is that from a programmer's point of view, a BLOCK construct can ALSO be used to simplify code sections that involve a series of checks.  And yes, a single-pass DO loop can also achieve the same result.  But IMHO, the BLOCK construct appears more elegant and clearer.

Now consider the code in OP and in Jim's suggestions in Quote #1 and what you'll find is this:

  • The code in OP is traditional style with a series of checks involving IF.. THEN.. ELSE statements which can lead to a lot of code indentation and the logic can become convoluted.
  • Jim's suggestions involve GOTOs and ALSO, smart placements of GOTOs to create a predictive pattern for the compiler and thereby, avoid branch overs and jumps (or far branches)

To which you responded simply write clear code and let the compiler take care of the rest.

Now with the BLOCK construct (or with a single-pass DO loop), the program can be made to behave the same as, say, the code in Quote #1 by Jim.  But when such code involving the BLOCK construct is compiled with optimizations turned on (/O2 or /O3), will it execute as well as the code given by Jim?  Or will the use of the BLOCK scoping construct lead to some slow down due to some internal aspects of how IFORT implements this feature?

That is, is there any penalty for the programming simplicity that can be gained by BLOCK construct for the "series of checks" case given by the OP?

There is no penalty I know of for using BLOCK in this manner. I agree it is clean to read.

Steve - Intel Developer Support

Oh well, the example I gave in Quote #5 is not yet functional in Intel Fortran because the EXIT statement does not work inside the BLOCK construct.  But note this example works correctly in gfortran 4.9.

Steve, can you indicate whether the EXIT statement aspect will make into the compiler 2015 version along with BLOCK?

Here's an explanation of these two features by John Reid in "The New Features of Fortran 2008" document:

Attachments: 

AttachmentSize
Downloadimage/png Block.png277.62 KB

We're aware of this feature - it's on our list, but not currently implemented and probably won't get done for 15.0. I have reminded the developers about it.

Steve - Intel Developer Support

Hack(barf)

outer: block
outer_block: do
  do i=1,num_in_set
    if(x==a(i)) exit outer_block
  end do
  call r
  exit outer_block
end do outer_block
end block outer

I am aware that an IF statement can be named. The user guide I have does not state if the EXIT name can be used to exit a named IF statement/block. You could try it if you want (it will look cleaner than the dummy DO loop and you won't need the extra exit at the bottom of the dummy loop.

Jim Dempsey

 

www.quickthreadprogramming.com

We don't yet support the enhanced uses of EXIT as specified in F2008, including an IF block.

Steve - Intel Developer Support

I think BLOCK is awesome, but...

Quote:

Steve Lionel (Intel) wrote:
The main purpose of BLOCK is to be able to declare variables that are local to the block - it's very helpful in DO CONCURRENT to avoid loop-carried dependencies.

Drifting off topic, but don't you just exchange a loop-carried dependency for an undefined variable access?

It certainly can help from accessing a variable that is undefined after the DO CONCURRENT terminates.

Ian,

Here's an example from the standard that shows how BLOCK is useful in DO CONCURRENT:

! A variable that is effectively local to each iteration of a 
! DO CONCURRENT construct can be declared in
! a BLOCK construct within it. For example:

DO CONCURRENT (I = 1:N)
  BLOCK
    REAL :: T
    T = A(I) + B(I)
    C(I) = T + SQRT(T)
  END BLOCK
END DO

If you didn't have BLOCK to declare a "threadprivate" T, then all iterations would be sharing the same T.

Steve - Intel Developer Support

FWIW T is on the stack of the local thread and not in the threadprivate area of the thread. I am sure that is why Steve quoted "threadprivate"

Jim Dempsey

www.quickthreadprogramming.com

Yes, exactly.

Steve - Intel Developer Support

I don't think block makes any difference in that example, bar making the intent of the programmer absolutely clear (which is still important - is that what you meant?).  The alternative without block:

REAL :: T
DO CONCURRENT (I = 1:N)
    T = A(I) + B(I)
    C(I) = T + SQRT(T)
END DO

No loop carried dependency there - in each iteration T is defined before it is referenced.  If the compiler wants to execute the iterations in parallel then it needs to create a temporary for T for each iteration, but that's the compiler's problem, not the programmer's.

In the block case T doesn't exist after the construct which is a easily diagnosed error (unless it also exists in the parent scope, in which case the programmer deserves whatever they get) - here it is undefined - so BLOCK does help with that.

Edit to add... here's an example that tries to show what I was thinking - as presented it is an erroneous DO CONCURRENT construct because, assuming N > 1,  in iteration N T is referenced without having been previously defined in that iteration and it is defined in another iteration (the loop dependency stuff banned by the requirements on things inside DO CONCURRENT constructs).  Uncomment the BLOCK bits and the local declaration - now T is undefined in iteration N.

Perhaps it is easier for the compiler to diagnose one or other of the errors (I haven't checked) - but neither of  the errors is called out by constraint, so there's no requirement that they even be diagnosed.  And if N == 1 then both forms are legal (I think?).

Edit**2 to add the actual example... oops.

REAL :: T
DO CONCURRENT (i = 1:N)
! BLOCK
!   REAL :: T
    A(i) = some_pure_function(B(i))
    
    IF (i == 1) THEN
      T = A(i)
    ELSE IF (i == N) THEN
      A(i) = A(i) + T
    END IF
! END BLOCK
END DO

Perhaps I got the subject of the "helpful" comment wrong - were you perhaps saying it is more helpful for the compiler (more so than more helpful for programmers) because the loop independent nature of the variable is obvious so it is more likely that their compiler will generate something that can and will be executed in parallel?  Fair enough in that case, though my recollection from playing with this some time back was that ifort had analysis capability well beyond that anyway.

That's not the way the language is defined. Without BLOCK, there is a single instance of T for the procedure. If DO CONCURRENT is executed in parallel, each of the iterations is referencing the same T and they step on each other. The compiler isn't free to invent its own iteration-local copies.

Steve - Intel Developer Support

Quote:

Steve Lionel (Intel) wrote:

That's not the way the language is defined. Without BLOCK, there is a single instance of T for the procedure. If DO CONCURRENT is executed in parallel, each of the iterations is referencing the same T and they step on each other. The compiler isn't free to invent its own iteration-local copies.

Steve,

While your comments in general are valid about BLOCK construct and the applicability about BLOCK constructs in DO CONCURRENT, I agree with Ian in that the specific example you provide in Quote #14 above (from Intel 2015 Beta documentation)  is incorrect.  In that specific example, as explained by Ian, there is no loop carried dependency and the DO CONCURRENT loop should work in parallel even without the BLOCK construct; as to how a compiler would do it, as stated by Ian, is compiler's problem - per Fortran 2008 standard, the programmer is not required to do anything to get the two instructions of

 T = A(I) + B(I) 
and
 C(I) = T + SQRT(T) 
to work in parallel.

A different, correct example for BLOCK construct is called for; in addition, I think Intel documentation should convey the broader applicability of BLOCK instead of the narrower view portrayed by documentation in 2015 beta. It'll be great if the team working on 2015 release can follow-up on these two suggestions.

By the way, an implementation of BLOCK without EXIT comes across as quite lame, IMHO.  I think it would be better to hold off on releasing BLOCK until the work on EXIT is completed.

With some trepidation - I don't think that's quite right.  (This isn't my view in isolation - there's been discussion about this elsewhere (perhaps c.l.f.) that informed my thinking, but I have botched a few concepts up in the last week or two since returning from a road trip during which I had to endure about 60 hours of two children under the age of five yelling at me from the back seat.)

I think it is a bit the other way around.  If a single storage location in memory is used for T (noting that's at the implementation level of detail) then the compiler must not arrange for the iterations to be executed in parallel, because then the observable behaviour wouldn't match what the language specifies.  DO CONCURRENT doesn't actually mean "the compiler must do it concurrently", but the restrictions on what can go inside make it easier (...relatively... something that is difficult is still easier than something that is nearly impossible) for the compiler to set things up such that it can do stuff concurrently.  The behaviour is specified as execution in "any order", which still implies that there is an order, just not one that you can know ahead of time, and bar an exception for sequential output, that's all the restrictions really support.

If the intended concept was concurrent-execution-possible-without-additional-compiler-analysis-and-transformation (such as introducing temporaries), then the restrictions on the program associated with DO CONCURRENT in 8.1.6.7 are inadequate and misdirected at the same time.  You would need to always ban the definition of a variable in more than one iteration... but the standard doesn't do this - the bit around variable definition has an "or".  The statement "a variable that is defined or becomes undefined by more than one iteration becomes undefined when the loop terminates" acknowledges the possibility of definition in multiple iterations and I think that requirement exists to allow for transformations such as temporaries.

 

No - BLOCK is awesome.  I want it now!  exit from block can wait!!

There is a nuance (implementation detail) with respect to the DO CONCURRENT

At what scope is the stack reservation for the block contained T made? Note this is not the same as the scope of where the block contained T is visible/accessible. The additional stack space for T could be made once outside the block (but only visible inside the block). If so then the stack of T is that outside the scope of the block(s). On the other hand, if the stack reservation is made inside the scope of the block, and the block is inside the DO CONCURRENT, then the stack allocation is made on each pass through the block .AND. the stack placement is that local to the thread (assuming DO CONCURREN was made parallel).

Note, in the former case (T on stack outside scope of block) T is like a dummy reference (and shared), in the latter T is like a local variable (and private).

I would like to know if the behavior of where T is located is specified in the Fortran specification.

Jim Dempsey

www.quickthreadprogramming.com

The Fortran standard never talks about "where" a variable is allocated - instead it describes the effect of using a language feature.

"Except for the ASYNCHRONOUS and VOLATILE statements, specifications in a BLOCK construct declare construct entities whose scope is that of the BLOCK construct (16.4)."

"Actions on a variable local to a BLOCK construct do not affect any variable of the same name outside the construct."

It's just like declaring a variable in a {} section in C - variables declared there are local to the construct, come into existence when the BLOCK is entered (BLOCK is an executable construct) and become undefined when it exits.

It is absolutely true that DO CONCURRENT does not require parallelization - rather, it specifies semantics that allow for parallelization (or vectorization.)

Steve - Intel Developer Support

Consider

subroutine foo
  integer(4) :: local
  ...
  do i=1, bazillion
    ...
    block
      integer(4) :: X
      ...
    end block
    ...
    block
    integer(4) :: Y
    ...
    end block
    ...
  end do
  ...
end subroutine foo

The compiler could determine foo requires 4 bytes for variable local and 4 bytes for the union of the requirements for each block. And then perform the stack reservation once at program start. This places X and Y at the same location of the stack external to both blocks. And this also saves 4 x bazillion stack adjustments.

.OR. the compiler could reserve the 4 bytes for variable X or Y and at each entry and exit of block reserve and release 4 bytes of stack. IOW incur 4 x bazillion sub/add 4 from/to stack pointer.

The conscientious programmer might be tempted to implement the former over the latter.

Then consider

subroutine foo
  integer(4) :: local
  ...
  do i=1, bazillion
    ...
    do concurrent
      block
        integer(4) :: X
        ...
      end block
    end do concurrent
    ...
    do concurrent
      block
        integer(4) :: Y
        ...
      end block
    end do concurrent
    ...
  end do
  ...
end subroutine foo

Now then, assuming parallelization is made, if the first stack allocation strategy is chosen X's of all threads in the first DO CONCURRENT are shared. Same with Y, but the Xs and Ys are not cross shared due to implicit barrier at end of do concurrent.

Now the nuance

"Actions on a variable local to a BLOCK construct do not affect any variable of the same name outside the construct."

In the first block, the X is declared inside the block and meets the above qualification, yet requires a means of privatization to produce code that runs deterministically.

If the 2nd stack allocation strategy is chosen (and the threads are not sharing the same stack space - normally not), then the 2nd stack allocation strategy satisfies both the enquoted qualification .AND. the " requires a means of privatization". However it also comes at the (minor) expense of the bazillion stack reservation/release.

I know this is "picky" but this is why we have standards.

From what you quoted in #23, it is undetermined (specified) as to where the reservation is made.

Jim Dempsey

www.quickthreadprogramming.com

Steve, FortranFan and IanH,

I must admit that I was convinced that IanH and FortranFan were correct, and that a BLOCK construct was not required to create scalars that are not in danger of being stepped on by other iterations if the do concurrent construct is executed in parallel. I’m not sure if this is the final document defining the standard, but it seems to be the closest thing I can find for free on the internet: the J3 F2008 latest draft at: Fortran 2008 (latest draft)

The language referenced by both MFE and 8.1.6.7 it seems quite convincing that no BLOCK construct is required, however, reading further down the page, Note 8.11 explicitly states:

 A variable that is eectively local to each iteration of a DO CONCURRENT construct can be declared in

a BLOCK construct  within it. For example:

So, if the content and language of the final official F2008 specification is the same, and Note 8.11 is included, this leads me to believe that Steve is correct, Block is required inside do concurrent to define a scalar variable defined in more than one loop iteration, even if no inter-loop dependencies are carried.

I agree with IanH’s assessment that 8.1.6.7 does not do a good job explaining this:

If the intended concept was concurrent-execution-possible-without-additional-compiler-analysis-and-transformation (such as introducing temporaries), then the restrictions on the program associated with DO CONCURRENT in 8.1.6.7 are inadequate and misdirected at the same time.  You would need to always ban the definition of a variable in more than one iteration... but the standard doesn't do this - the bit around variable definition has an "or".  The statement "a variable that is defined or becomes undefined by more than one iteration becomes undefined when the loop terminates" acknowledges the possibility of definition in multiple iterations and I think that requirement exists to allow for transformations such as temporaries.

Note 8.11 seems to definitively imply that a BLOCK  construct is required to define iteration local scalar variables. I’m not sure what the official procedure is, but an official request to the standards committee for clarification and to revisit the wording of 8.1.6.7 could prove quite helpful to clear this up definitively.

-Zaak

This section of the standard has been examined extensively - it is the subject of three interpretations as of Corrigendum 2. (None in Corrigendum 3.) However, the changes have all been to strengthen or clarify 8.1.6.7 Restrictions on DO CONCURRENT.

Note 8.10 says "The restrictions on the statements in the range of a DO CONCURRENT construct are designed to ensure that there are no data dependencies between iterations of the loop. This permits code optimizations that might otherwise be difficult or impossible because they would depend on properties of the program not visible to the compiler."

I don't agree with Ian's interpretation of 8.1.6.7, though I can see where he is coming from. But I will discuss it with other members of the committee when I see them later this month.

Steve - Intel Developer Support

It'll great if the standard, or some supporting document or a new appendix/edition to MFE book, could provide a better interpretation of the "execute in parallel" context of DO CONCURRENT - that's where I'm getting lost.  That was the theme behind Steve's comment in Quote #18 that initiated this round of discussion and which I am struggling to understand.

Please don't misunderstand - the standard doesn't say anything about DO CONCURRENT executing in parallel. Rather, the standard specifies semantics that are conducive to parallelization. It's not like an OpenMP PARALLEL DO directive.

Steve - Intel Developer Support

Steve,

Note that one can use a scalar without introducing a data dependency between loop iterations. It can be defined and then referenced within the same loop iteration; therefore, its value is unambiguous during that iteration. Note 8.10 misses the mark if its intent was to clarify this topic. There is no data dependency between loops because the scalar variable never appears on the right hand side of an assignment until it has been defined in that loop. However, note 8.11 does at least imply that a block construct is required to prevent issues when defining scalars inside do concurrent. I don’t know where the right place to look for official documents is, but judging by http://www.nag.com/sc22wg5/ it looks as though the “do iteration local scalars require block construct?” question has not been addressed by any of these interpretations or corrigenda 1-3. A formal interpretation would be nice and clearer language.

-Zaak

Given that the standard includes an explicit example, which I quoted above, I think the standard's authors thought it clear.

Steve - Intel Developer Support

(For clarity - my post #20 was in reply to Steve's #18.  This might be the discussion elsewhere that I referred to... https://groups.google.com/d/topic/comp.lang.fortran/bHTjFSj-LJo/discussion but given I seem reasonably confident of my opinion there it is possible that there was one earlier than that.)

While it is correct (because it says "can", not "must") I think note 8.11 is a bit misleading.

One for the thinkers and philosophers... inside block-like-constructs (execution constructs that have an opening statement and a closing statement, with executable code in between - like IF...END IF or DO...END DO) why is it necessary to have a "duplicate" BLOCK construct to wrap the specification part and the executable part - why couldn't the language rules for block-like-constructs simply be changed to allow them to have a specification part before their execution part?

i.e. rather than requiring the first example, permit the second example.

DO i = 1, 10
  BLOCK
    ! Specification part.
    REAL :: T
    !***
    ! Execution part.
    T = A(i) + B(i)
    C(i) = T
  END BLOCK
END DO

DO i = 1, 10
  ! Specification part - currently not permitted here.
  REAL :: T
  !***
  ! Execution part.
  T = A(i) + B(i)
  C(i) = T
END DO

Perhaps it gets a bit hairy with things like the labelled form of DO construct?

Sorry guys, although I did the original post, I got lost.

Cheers

PS: have a look at my next post!

IanH,

I agree with your observation. FWIW, I'd have preferred they chose {}'s either would avoid assigning a keyword.

I suspect that doing so (IanH's suggeston) would have "broke" a conformity test program looking for specifications in the wrong place.

Jim Dempsey

www.quickthreadprogramming.com

Quote:

may.ka wrote:

Sorry guys, although I did the original post, I got lost.

Cheers

PS: have a look at my next post!

Sorry Karl, it was I with postings on BLOCK construct in Quotes #5 and #9 that threw this topic off course.

How about we get back to your original question?  Especially since DO CONCURRENT in the context of concurrency and parallelism and the general aspects of BLOCK construct are worthy of separate, major discussion threads instead of getting buried here.

As I mentioned earlier, based on my investigation, the BLOCK construct introduced in Fortran 2008 standard can be very helpful in simplifying code sections that involve a series of checks.  And yes, a single-pass DO loop can also achieve the same result.  But in my opinion, the BLOCK construct appears more elegant and clearer.  Have you taken a look the code snippet in Quote #5?  How do you find it?  Do you think if you were to use such a construct, it will simplify your code and it will read easier than what you have currently?  Your comment will be useful feedback for me as I'm wondering whether to propose it as a "recommended programming practice" to my colleagues at work.  By the way, you will not be able to test it out because the EXIT aspect has not yet been implemented in Intel Fortran (however such a capability is available in gfortran 4.9, the open-source, "free" compiler).

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today