TBB for weakly consistent memory model

TBB for weakly consistent memory model

I'm currently trying to port the TBB library to the IBM AIX platform.
In this process I discovered some issues due to the weakly consistent
memory model in the PowerPC architecture.

There seem to be a number of places in TBB, where a critical
section is established by means of atomic operations.  For example,
this happens in task.cpp for entering and leaving an arena, or in
spin_rw_mutex.cpp.

The problems on IBM appear, if a critical section is released by
just updating a global variable:

    mutex->state = 0;

The PowerPC architecture may reorder the store operations, and a store
from within the critical section might occur after the critical
section is released.

I was able to fix the problem by using the __TBB_store_with_release
macro instead of the plain store operation.  I attach a diff file
below to show where I made the changes.  Those are the locations I
updated to make all the tests succeed.  But there might be other
places which I missed.

In any case I hope that these changes can be incorporated into TBB to
better support weakly consistent memory models.




==== tbb/include/tbb/spin_mutex.h ====
***************
*** 112,118 ****
  #if TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT
              internal_release();
  #else
!             my_mutex->flag = static_cast(my_unlock_value);
              my_mutex = NULL;
  #endif /* TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT */
          }
--- 112,119 ----
  #if TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT
              internal_release();
  #else
!             // my_mutex->flag = static_cast(my_unlock_value);
!             __TBB_store_with_release(my_mutex->flag, static_cast(my_unlock_value));
              my_mutex = NULL;
  #endif /* TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT */
          }
***************
*** 123,129 ****
  #if TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT
                  internal_release();
  #else
!                 my_mutex->flag = static_cast(my_unlock_value);
  #endif /* TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT */
              }
          }
--- 124,131 ----
  #if TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT
                  internal_release();
  #else
!                 // my_mutex->flag = static_cast(my_unlock_value);
!                 __TBB_store_with_release(my_mutex->flag, static_cast(my_unlock_value));
  #endif /* TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT */
              }
          }



==== tbb/include/tbb/tbb_machine.h ====
***************
*** 101,106 ****
--- 101,109 ----
      //! This definition works for compilers that insert acquire fences for volatile loads, and load T atomically.
      template
      inline T __TBB_load_with_acquire(T const volatile& location) {
+ #if _AIX
+         __tbb_flush ();
+ #endif
          return location;
      }
  #endif
***************
*** 109,114 ****
--- 112,120 ----
      //! This definition works only for compilers that insert release fences for volatile stores, and store T atomically.
      template
      inline void __TBB_store_with_release(volatile T &location, V const& value) {
+ #if _AIX
+         __tbb_flush ();
+ #endif
          location = value; 
      }
  #endif



==== tbb/src/tbb/spin_mutex.cpp ====
***************
*** 46,52 ****
      __TBB_ASSERT( !(my_unlock_value&1), "corrupted scoped_lock?" );
  
      ITT_NOTIFY(sync_releasing, my_mutex);
!     my_mutex->flag = static_cast(my_unlock_value);
      my_mutex = NULL;
  }
  
--- 46,53 ----
      __TBB_ASSERT( !(my_unlock_value&1), "corrupted scoped_lock?" );
  
      ITT_NOTIFY(sync_releasing, my_mutex);
!     // my_mutex->flag = static_cast(my_unlock_value);
!     __TBB_store_with_release(my_mutex->flag, static_cast(my_unlock_value));
      my_mutex = NULL;
  }
  



==== tbb/src/tbb/spin_rw_mutex.cpp ====
***************
*** 68,74 ****
  void spin_rw_mutex::internal_release_writer(spin_rw_mutex *mutex) {
      __TBB_ASSERT( (mutex->state & ~state_t(2))==1, "invalid state of a write lock" );
      ITT_NOTIFY(sync_releasing, mutex);
!     mutex->state = 0; 
  }
  
  //! Acquire lock on given mutex.
--- 68,75 ----
  void spin_rw_mutex::internal_release_writer(spin_rw_mutex *mutex) {
      __TBB_ASSERT( (mutex->state & ~state_t(2))==1, "invalid state of a write lock" );
      ITT_NOTIFY(sync_releasing, mutex);
!     // mutex->state = 0; 
!     __TBB_store_with_release(mutex->state, 0);
  }
  
  //! Acquire lock on given mutex.
***************
*** 119,125 ****
  void spin_rw_mutex::internal_downgrade(spin_rw_mutex *mutex) {
      __TBB_ASSERT( (mutex->state & ~state_t(2))==1, "invalid state before downgrade" );
      ITT_NOTIFY(sync_releasing, mutex);
!     mutex->state = 4; // Bit 2 - reader, 00..00100
      __TBB_ASSERT( mutex->state & ~state_t(3), "invalid state after downgrade: no readers" );
      __TBB_ASSERT( (mutex->state & 1)==0, "invalid state after dowgrade: active writer" );
  }
--- 120,127 ----
  void spin_rw_mutex::internal_downgrade(spin_rw_mutex *mutex) {
      __TBB_ASSERT( (mutex->state & ~state_t(2))==1, "invalid state before downgrade" );
      ITT_NOTIFY(sync_releasing, mutex);
!     // mutex->state = 4; // Bit 2 - reader, 00..00100
!     __TBB_store_with_release(mutex->state, 4); // Bit 2 - reader, 00..00100
      __TBB_ASSERT( mutex->state & ~state_t(3), "invalid state after downgrade: no readers" );
      __TBB_ASSERT( (mutex->state & 1)==0, "invalid state after dowgrade: active writer" );
  }



==== tbb/src/tbb/task.cpp ====
***************
*** 1244,1250 ****
      __TBB_ASSERT( dummy_slot.task_pool->prefix().steal_beginsteal_end = 2*deepest;
  }
  
  #if BALANCED_EXCHANGE
--- 1244,1251 ----
      __TBB_ASSERT( dummy_slot.task_pool->prefix().steal_beginsteal_end = 2*deepest;
!     __TBB_store_with_release(arena_slot->steal_end, 2*deepest);
  }
  
  #if BALANCED_EXCHANGE
***************
*** 1445,1451 ****
          tp->prefix().steal_begin = i;
      // Release the task pool
      ITT_NOTIFY(sync_releasing, &arena_slot);
!     arena_slot.steal_end = steal_end;
  }
  done:
      return result;
--- 1446,1453 ----
          tp->prefix().steal_begin = i;
      // Release the task pool
      ITT_NOTIFY(sync_releasing, &arena_slot);
!     // arena_slot.steal_end = steal_end;
!     __TBB_store_with_release(arena_slot.steal_end, steal_end);
  }
  done:
      return result;
***************
*** 1694,1700 ****
              break;
          }
      }
!     arena_slot->steal_end = 2*deepest+1;
  }
  
  void GenericScheduler::leave_arena( bool compress ) {
--- 1696,1703 ----
              break;
          }
      }
!     // arena_slot->steal_end = 2*deepest+1;
!     __TBB_store_with_release(arena_slot->steal_end, 2*deepest+1);
  }
  
  void GenericScheduler::leave_arena( bool compress ) {
***************
*** 1724,1730 ****
          for(;;) {
              new_limit = arena->prefix().limit.compar
e_and_swap( k, k+1 );
              ITT_NOTIFY(sync_releasing, &arena->slot[k]);
!             arena->slot[k].steal_end = -4;
              if( new_limit!=k+1 ) break;
              if( --kslot[k].steal_end==-4 && __TBB_CompareAndSwapW( (volatile void *)&(arena->slot[k].steal_end), -4|1, -4 )==-4 ) {
--- 1727,1734 ----
          for(;;) {
              new_limit = arena->prefix().limit.compare_and_swap( k, k+1 );
              ITT_NOTIFY(sync_releasing, &arena->slot[k]);
!             // arena->slot[k].steal_end = -4;
!             __TBB_store_with_release(arena->slot[k].steal_end, -4);
              if( new_limit!=k+1 ) break;
              if( --kslot[k].steal_end==-4 && __TBB_CompareAndSwapW( (volatile void *)&(arena->slot[k].steal_end), -4|1, -4 )==-4 ) {
***************
*** 1735,1741 ****
          }
      } else {
          ITT_NOTIFY(sync_releasing, &arena->slot[k]);
!         arena->slot[k].steal_end = -4;
      }
  }
  
--- 1739,1746 ----
          }
      } else {
          ITT_NOTIFY(sync_releasing, &arena->slot[k]);
!         // arena->slot[k].steal_end = -4;
!         __TBB_store_with_release(arena->slot[k].steal_end, -4);
      }
  }

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Dear Urs

Thank you very much for the wonderful patch.

Could we ask a big favor of you? Could you please submit the patch through the 'contributions' tab in the TBB web site? In that way, we would be able to track who submit what patches, so on.

Thank you very much !!!

-wooyoung
TBB developer

It looks like you've caught some interesting cases of undeclared atomic uses. Thanks for the contribution, though if you could follow the process suggested by wooyoung, that would help us a lot.

I did a little digging on this issue and have a couple of questions. I see that the patch adds an AIX conditional that causes the calling of something called __tbb_flush. I don't see a definition of tbb_flush, either in this patch or in my recent version of the source. I assume this is where the meat of the weakly consistent memory model meets the road. Do you have an implementation for this function?

Doing a bit of googling, I ran across the following excerpt from an IBM white paper:

The architects defined the PowerPC memory model to be weakly consistent, allowing memory accesses to complete out of order. The Enforce In-order Execution of I/O (eieio) instruction definition gives software an efficient means of controlling the order in which accesses to I/O devices complete.

Because POWER Architecture lacked an atomic memory operation, two new instructions provide a means of performing a memory operation that appears to read and then modify a memory location as one atomic operation. These instruction are similar to those proposed in a report from the Lawrence Livermore National Laboratory [6]. The Load Word and Reserve (lwarx) instruction copies the content of the target memory location, which can be regarded as a lock variable, into a register and then creates a reservation on that location. The program could use the loaded value to compute the new value to be stored in the lock variable. When the processor executes a Store Word Conditional (stwcx) instruction, the processor attempts to store the new value. If the lock variable has not been modified since the lwarx completed, the processor performs the store; otherwise it does not perform the store. In both cases, the processor records the status to indicate the result. If the stwcx fails to perform the store, the programmer can repeat the sequence.

Because these are new instructions that do not affect the binary compatibility of POWER applications, the architecture defined them to access only word-aligned memory locations. If these instructions attempt to access a location that is not word-aligned, the architecture does not define the result of the access. This definition permits simpler implementations, avoiding the need for hardware checking for a programming error that is easily avoided.

At first I thought this was a spoof site, reading about that Enforce In-order Execution of I/O (EIEIO) instruction (old MacDonald must be rolling in his grave ;-), but on further reflection, this lwarx/stwcx pairappears to be a means to handle atomicscorrectlyin the weakly consistent memory on PowerPC. Is this the approach you're using with tbb_flush? It seems for this instruction pair to work that you'd need to use the volatile address and value in the call, so perhaps you have some other means of ensuring atomicity on PowerPC?

p.s., the latest TBB source release has a file, include/tbb/machine/mac_ppc.h, which implements the weakly consistent memory access recommended to mimic atomic accesses using lwarx/stwcx on PowerPC. Perhaps something similar can be used for AIX PPC?

Clarification: vonmatt's patch (thanks!) addresses memory fence issues, not atomicity. lwarx/stwcxaddesses atomicity issues. Itanium has a weak consistency model too, but has a compiler convention where volatile loads imply an acquire fence and volatile stores imply a release fence. Compilers for the Power architecture apparently do not have this convention, hence we need to be careful in the future to not depend upon the Itanium convention. Furthermore, Itanium has totally ordered atomic operations, whereas my impression is that Power has a weaker model for ordering of atomics. So we'll also have to be careful to not depend on totally ordered atomics unless the atomic operations are explicitly defined to be totally ordered (e.g. have a flush fence on Power).

- Arch Robison

I just submitted my patches through the
contributions tab.  It includes two additional
files that I introduced for the IBM:

    tbb/include/tbb/machine/ibm_aix51.h:
      definitions for IBM AIX

    tbb/src/tbb/ibm_aix51/tbb_atomic.c:
      support for atomic operations

For my porting effort I used the atomic
operation 'compare_and_swap' which is part of
the standard C runtime library on IBM (I did
not use the lwarx/stwcx PowerPC instructions
myself, but compare_and_swap is probably using
them).  According to the man-page, this
function only performs the atomic operation.
It does not provide any memory fences:

    The compare_and_swap subroutine performs an
    atomic operation which compares the
    contents of a single word variable with a
    stored old value. If the values are equal,
    a new value is stored in the single word
    variable and TRUE is returned; otherwise,
    the old value is set to the current value
    of the single word variable and FALSE is
    returned.

    The compare_and_swap subroutine is useful
    when a word value must be updated only if
    it has not been changed since it was last
    read. Note: The word containing the single
    word variable must be aligned on a full
    word boundary. Note: If compare_and_swap is
    used as a locking primitive, insert an
    isync at the start of any critical
    sections.

To add the memory fences I introduced my own
two functions __tbb_cas_32() and
__tbb_cas_64().  Because a compare-and-swap
could happen at the beginning or at the end of
a critical section, I always perform both the
release and the acquire operation ("sync" and
"isync" assembly codes).  I'm sure that these
functions could be optimized further by using
inline assembly code instead of a
compare_and_swap function call.  But at least
this code seems to be correct and has been
working reliably.

Additionally, I introduced the function
__tbb_flush() which works as an explicit memory
fence.  It is only invoked in
__TBB_load_with_acquire() and
__TBB_store_with_release().

- Urs von Matt

The patches won't make it into this month's developer release, but I expect them to be in the next developer release (probably near the end of November).

We do not have an AIX machine here for testing, so I'm trying to revise the patch in a way so we detect most of the problems on our Itaniums. E.g., by removing volatile qualifiers from field declarations. That way, if we use a plain assignment instead of __TBB_load_with_acquire or __TBB_store_with_release, the fence will be missing on Itanium too, and our unit tests will let us know about it.

Thanks again for contributing the patches.

I believe the contributed patch slightly misplaces the __tbb_flush() in __TBB_load_with_acquire, because it allows another load to migrate above the load of location. Here's a revised version that prevents that upward migration.

 template
inline T __TBB_load_with_acquire(const volatile T& location) {
T temp = location;
#if _AIX
__TBB_flush ();
#endif /* _AIX */
return temp;
}
#endif

I think you are right with your change. The correct sequence should be:

acquire lock
sync
protected operations
sync
release lock

This is also the recommendation that I see in the PowerPC architecture book, volume II, pp. 47-48 (http://www.ibm.com/developerworks/power/library/pa-archguidev2).

Urs

Leave a Comment

Please sign in to add a comment. Not a member? Join today