xbox port possible issue

xbox port possible issue

Hi,

I think that there might be an issue in the xbox360 port available in the stable version tbb30_20100406oss.

The xbox implementation of__TBB_load_with_acquire and __TBB_store_with_release use the tbb_machine.h generic implementation. It is based on volatile read and write, and a call to __TBB_release_consistency_helper.

According to the documentation about the xbox360 lockless programming http://msdn.microsoft.com/en-us/library/ee418650(VS.85).aspx, volatile read and write behave that way: "Reading and writing of volatile variables using Visual Studio C++ 2005 prevents CPU read/write reordering on Windows, but on Xbox 360, it only prevents compiler read/write reordering".

Also, __TBB_release_consistency_helper is defined as _ReadWriteBarrier in xbox360_ppc.h. According to the same documentation, this intrinsic prevents from compiler reordering.

So the acquire and release semantic of __TBB_load_with_acquire and __TBB_store_with_releaseare effective on the compiler side, thanks to the volatile keyword, but nothing prevents cpu reordering. According to the same documentation again, the __lwsync intrinsic should be used in that purpose.

So I think there is anerrorin __TBB_release_consistency_helper macro definition, where I would addor replace __lwsync.

Am I wrong?

Regards
Guilllaume

19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Guillaume,

Actually you are likely to be exactly right :)

TBB team does not have Xbox360 development license and hardware. So we do not "officially" maintain Xbox360 port. The port you see in the package was an external contribution by Roman Lut of Deep Shadows done about two years ago. Macro __TBB_release_consistency_helper was added later, and probably has never been tested.

Now that I'm looking at the code I think we may need __TBB_rel_acq_fence to be "sync" instead of "lwsync". BTW, what is the intrinsic for "sync" in XBOX compiler?

If you are willing to try TBB 3.0 on Xbox, then besides this fence stuff you will need to add missing export symbols into src/xbox360-tbb-export.def. They are for the functions added after the original port was done. Since Xbox compiler has a mangling scheme different from the oneused by MS on other architectures, we were unable to deduce them.

In case of any further questions you are welcome at this forum. And if you manage to make it work :) we would gladly accept your contribution :).

"Now that I'm looking at the code I think we may need __TBB_rel_acq_fence to be "sync" instead of "lwsync"."
Why? Shouldn't somebody who wants those semantics have to bring their own fence? It seems needlessly expensive to invoke a more powerful and probably (significantly?) more expensive instruction than exactly what is needed to do the job.

__TBB_rel_acq_fence is supposed to be a full fence. And "lwsync" allows load-before-store reordering. Such reordering breaks the core part of stealing arbitration protocol in the scheduler.

"__TBB_rel_acq_fence is supposed to be a full fence. And "lwsync" allows load-before-store reordering. Such reordering breaks the core part of stealing arbitration protocol in the scheduler."
Whoops, sorry, I just quickly peeked at http://www.ibm.com/developerworks/systems/articles/powerpc.html and jumped to conclusions based on all the adapter device driver talk, skipping the table immediately after it and forgetting that I myself had earlier used "sync" to prohibit store/load reordering. I don't have the latest sources here (and other stuff to do), but I hope that the code would still use "lwsync" instead of "sync" where appropriate?

Sure. Interlocked operations and macros for standalone acquire/release fences will continue using __lwsync.

BTW, we all have initially overlooked the fact that __lwsync does not actually enforces acq_rel ordering. I vaguely recollect that we were misguided by one of C++0x proposal docs that claimed that it does.

"BTW, we all have initially overlooked the fact that __lwsync does not actually enforces acq_rel ordering. I vaguely recollect that we were misguided by one of C++0x proposal docs that claimed that it does."

Hmm... that document says:

"lwsync applicability is limited to the following cases:
1. loads preceding and following.
2. stores preceding and following.
3. loads preceding and stores following.
Thus stores preceding and loads following are not applicable in the case of lwsync."

Hi,

Are "__TBB_release_consistency_helper" and "__TBB_rel_acq_fence" expected behaviors explained somewhere?

Indeed I'm trying to make tbb3.0working on xboxfor our engine needs.I'd be glad to contribute to tbb if I manage to make it work, but a lot of things (like intrinsics and compiler specificities)are under nda. I took care when I've created mythread on that forum that all the things I was speaking about were publicly available (http://msdn.microsoft.com/en-us/library/ee418650(VS.85).aspx), but I need to check with MS if I can go further.

Many thanks
Guillaume

"Sure. Interlocked operations and macros for standalone acquire/release fences will continue using __lwsync."
I'm having a look at tbb30_20100406oss right now, and I see that TBB still only provides fully fenced read-modify-write operations (the only kind possible on x86), which for processors in the POWER/PowerPC family means doubled up (full fence on both sides), and often even in a loop, whereas, unless I'm mistaken, these processors should normally be able to provide even relaxed read-modify-write, e.g., for smart pointers.

I'm also not too enthused by the assumption that volatile reads and writes are indivisible, but maybe that's just me.

(Correction) The text above was modified from its original version, which was quite wrong.

After internal discussion among the TBB team, I believe that necessary fixes for xbox360_ppc.h are:

  • __TBB_release_consistency_helper() should expand to a __lwsync instruction.
  • __TBB_rel_acq_fence() should expand to a heavy weight sync instruction.

In both cases, compiler fences should be added if necessary to make the compiler respect the same ordering rules as those enforced by the instruction.

It appears that we misnamed __TBB_rel_acq_fence(), because it can easily be confused with the C++0x atomic_thread_fence(memory_order_acq_rel).Where __TBB_rel_acq_fence() is used inside TBB, it is usually used for the idiom "write; fence; read" where the intent is to establish a consensus between two threads. For example, as in Dekker's algorithm. That requires a fence stronger than a C++0x memory_order_acq_rel fence (Dmitriy V'yukov recently enlightened me on this point).

I'll try the fixes you recommend, but I'd also like to understand __TBB_rel_acq_fence() a bit more (which seems to haveaconfusing name indeed).

Quoting Arch Robison (Intel)

Where __TBB_rel_acq_fence() is used inside TBB, it is usually used for the idiom "write; fence; read" where the intent is to establish a consensus between two threads. For example, as in Dekker's algorithm. That requires a fence stronger than a C++0x memory_order_acq_rel fence (Dmitriy V'yukov recently enlightened me on this point).

Could you explain the fence requirement of the idiom "write; fence; read" that uses tbb or the Dekker's algorithm?

Here is the idiom. There are two shared global variables X and Y that are initially zero. Two threads execute the following:

Thread 1                          Thread 2
--------                             --------
X = 1;                              Y = 1;
__TBB_rel_acq_fence();     __TBB_rel_acq_fence();
r1 = Y;                             r2 = X;

The implementation of __TBB_rel_acq_fence() needs to guarantee that the outcome r1=r2=0 cannot happen. For example, on IA-32 loads can be reordered with earlier stores, which allows the outcome r1=r2=0 if the fences are missing. The compiler similarly must be prevented from hoisting the loads above the stores.

On PowerPC, lwsync apparently is not strong enough to prevent a load from being reordered with an earlier store. Hence it seems that a heavy weight sync is required to prevent r1=r2=0.

When we named __TBB_rel_acq_fence(), we were thinking of as something that atomically executed a release fence and an acquire fence, and thus prevented earlier stores and later loads from being reordered across it. But the similary named C++0x fence functionality seems to have subtlely different semantics since http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2009.02.22a.htmlrecommends implementinga acquire-release fence with lwsync.

Hi,

you said: "In both cases, compiler fences should be added if necessary to make the compiler respect the same ordering rules as those enforced by the instruction."

Indeed it seems to be needed as Fibonacci test case fails without it.

I tried the following impelmentation of the acquire and release semantic:
-Acquire: compiler read fence followed by lwsync
-Release: compiler write fence followed by lwsync

But it doesn't work. It seems that in both acquire and release cases a "read and write" compiler fence is required. Can you understand?

Could you tell me which compiler fence (read or write) shouldbe used for both acquire and release semantics, and if it should be beforeor after the cpu fence.

Regards
Guillaume

The default load-with-acquire routine is:

    template
    inline T __TBB_load_with_acquire(const volatile T& location) {
        T temp = location;
        __TBB_release_consistency_helper();
        return temp;
    }

This routine must guarantee that the load of location happens before any subsequent (in program order) memory operation. For example, if you imagine the dynamic execution of the program, the sequence of relevant statements might be something like:

    T temp = location;
    lwsync;
    ... = some_other_location; // Statement in caller after call to __TBB_load_with_acquire

You must keep the compiler from moving the lwsync either up or down, because it must come between the other two statements. You need a read barrier on both sides of the lwsync to ensure it is properly sandwiched between the other statements.

Likewise the default store-with-release routine is:

    template
    inline void __TBB_store_with_release(volatile T& location, V value) {
        __TBB_release_consistency_helper();
        location = T(value);
    }

and the dynamic execution, including a prior store by the caller, should expand to:

    some_other_location = ...;  // Prior store by caller
    lwsync;
    location = T(value);

Once again, the lwsync must be kept properly sandwiched between the two stores. You need a write barrier on both sides of the lwsync to do this. My recommendation would be to make __TBB_release_consistency_helper() expand to a lwsync instruction with compiler read-write barriers on both sides of it. E.g., something similar to _ReadWriteBarrier mentioned in Lockless Programming Considerations for Xbox 360 and Microsoft Windows. That way the compiler will keep it sandwhiched between whatever programmatically comes before and after it.

One thing to check is that TBB works right when using only one thread; i.e. when initialized with "task_scheduler_init init(1)". If that's not working, then the problem is something other than memory barriers.

I tried to add 2 compiler read barriers on both side of the lwsync for acquire semantics, and 2 compiler write barriers on both side of the lwsync for release semantics. But it fails the same way. I've done a few tests:
- It works with 2 "read and write" compiler barriers on both sides of the lwsync for acquire and release semantics
- It works if I use 2 write compiler barriers for acquire and 2 read compiler barriers for the release, which is the opposite of what we would have expected.
Can you understand why?

Also I've notice that the test case that fails is SerialQueueFib which is running on a single thread I think. So according to what you're saying, this should be due tosomething other than memory barriers. So how would you explain the behavior I've noticed, why would adding a compiler barrier change something, and why would it only fail in optimised/release builds? If you can give me a hint, I'll investigate it.

Regards
Guillaume

It's interesting observations.

In fact both acquire and release fences must prevent reordering of both stores and loads. They just prohibit reordering in one direction each. Acquire fence prevents hoisting subsequent loads/stores above it, and release fence prevents sinking preceding loads/stores below the fence.

Thus if the compiler does not heed __lwsync() intrinsic itself, only _ReadWriteBarrier can ensure the appropriate semantics. The fact that less restricting barriers worked for you probably means that the problems were caused by reordering of the corresponding kinds of instruction in the specific test cases.

As for the SerialQueueFib failure, the only plausible idea I have is that it may be an optimizer bug :) I'd suggest to disable this particular test and see if other ones are running OK.

Does test_atomic.cpp report any problems? It has a test HammerLoadAndStoreFence that is supposed to detect problems with the implementation of load-with-acquire and store-with-release. It has code sequences specifically designed to trick a sloppyoptimizer into violating fence rules.

We do not have good unit tests so far forproper fencing of read-modify-write operations. If someone wants to contribute additions to test_atomic.cpp that check for proper fencing of read-modify-write operations, we'll be happy to take them.

Thanks very much for the explanations...

I've disabled SerialQueueFib, and the other tests (of Fibonnacci.cpp) are running ok. I'd like to find the reason for that (because in my xp its rarelyan optimizer bug :))...
Is there another TBB test case that stresses compiler and cpu fences a lot?

Regards
Guillaume

If you can find the SerialQueueFib problem, it would be much appreciated. Since the bug shows up when only a single thread is in use, finding the problem is no harder than debugging sequential code :-). I would find the smallest value of n for which it fails and then either single-step it or add print statements showing what it is popping and pushing on the queue.

From what I remember, concurrent_vector is fairly aggressive about pushing the limits of acquire/release semantics. So test_concurrent_vector.exe would be good to run. But the number of events where the acquire/release comes into play with class concurrent_vector is small, because they happen only when it has to grow its internal structures, which is O(lg N) for a concurrent_vector of length N.

Any of the test_parallel_* tests are good for flogging the task scheduler, which internally depends upon various fences working correctly.

Leave a Comment

Please sign in to add a comment. Not a member? Join today