Overhead of HLE acquire and release

Overhead of HLE acquire and release

imagem de Rolf Andersson

Hi,

(also posted as a comment to a blog entry re tsx-tools by Andy)

I've just started playing around with the new TSX feature set.

I wrote a quick test with a loop over lock;xchgl and movl with and without HLE prefixes.
To my surprise, the version with HLE prefixes seems to be ~50% slower?
Is the test invalid/irrelevant for some reason?
Am I doing something wrong or is this expected?

Thanks,
Rolf

---

The test was run on a MacBook Air with an i7-4650U 1.7 GHz (Haswell) CPU

tsx-tools reports:
Rolfs-MacBook-Air:tsx-tools ran$ ./has-tsx
RTM: Yes
HLE: Yes
Rolfs-MacBook-Air:tsx-tools ran$

The code enclosed below was compiled with:
Rolfs-MacBook-Air:ran ran$ clang -O4 -o tt tt.c -lc

Rolfs-MacBook-Air:ran ran$ time ./tt 1 100000000

real 0m1.616s
user 0m1.612s
sys 0m0.004s
Rolfs-MacBook-Air:ran ran$ time ./tt 2 100000000

real 0m1.063s
user 0m1.061s
sys 0m0.002s
Rolfs-MacBook-Air:ran ran$

Source code for tt.c is attached.

AnexoTamanho
Download tt.c1.65 KB
25 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
imagem de Rolf Andersson

responding to my own post with some follow-up info:

I just ran pcm-tsx.x (from PCM 2.5.1) while executing "tt 1 1000000000" and there were no transactional cycles according to pcm-tsx.

Any assistance in explaining what is going on would be much appreciated.

imagem de Roman Dementiev (Intel)

Hi Rolf,

HLE mechanism is not that simple to be benchmarked with such unrealistic test. In 4th generation Intel Core architecture HLE/RTM/TSX should be used for critical sections that do useful non-trivial amount of work (please use a real application with lock contention to evaluate TSX) usually with a small to moderate level of data contention. In contrast to small synthetic microbenchmarks with tight loops, in a real application TSX overheads can be mostly hidden behind the out-of-order execution of the microarchitecture. Section 12.5 "TSX PERFORMANCE GUIDELINES" of the optimization manual is good to consider.

Roman

imagem de Rolf Andersson

Hi Roman, thank you for your prompt reply.

I started out trying to use lock elision for one of our applications, but got inconsistent results so I tried to simplify the code. I ended up with the purely synthetic case that I asked about above. I realize that it is unrealistic, but I'm still curious as to the overhead of acquire and release as, if I have understood correctly the locking part of "lock;xchg" would be elided. So, I expected some additional cost for HLE, but at the same time a cost saving for the elided lock. Are or will latency and throughput numbers be available for TSX, and is there a way to measure or estimate the savings for the elided lock?

To my other question about pcm-tsx not showing any transactional cycles, is there some other way to discern that the acquire and release operations have actually been executed?

Again, thanks for provising feedback.

Best,
Rolf 

imagem de Roman Dementiev (Intel)

Rolf,

one of the HLE/TSX targets is to allow concurrent threads to easily avoid unnecessary serialization of critical sections (see this blog) and seriazation in lock internals themselves (think about hardware serialization on atomic increment/decrement of the counter containing the number of concurrent readers in typical read-write lock implementations: it can be easily avoided with an RTM wrapper around the RW-lock - see Chapter 12 for examples of such). You can consider Amdahl's law thinking when estimating potential performance benefits for TSX in your application. 

Please post the pcm-tsx output here.

Thanks,

Roman

imagem de Rolf Andersson

Roman,
my hypothesis was that "lock;xchg" would incur coherence traffic and thus a number of cycles of latency (~access to LLC?), and that the lock elision operation somehow would hide that latency (or part thereof). I may have misunderstood how lock elision works.

Re concurrent threads - the naive test case I wrote would essentially cover a situation with zero contention. I would have thought that the elision mechanism would work the same irrespective of the fact that there is zero contention, with the exception that there would be no txn aborts?

Output from pcm-tsx follows below.

Thanks,
Rolf 

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.71 236 M 335 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.31 15 M 51 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.59 166 M 282 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.25 13 M 51 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.60 432 M 721 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.38 99 M 262 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.25 12 M 51 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.38 90 M 241 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.26 10 M 41 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.36 213 M 596 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.56 142 M 255 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.52 47 M 90 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.15 276 M 1891 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.54 32 M 59 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.22 498 M 2297 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.39 92 M 236 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.61 118 M 192 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.10 617 M 6338 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.44 39 M 88 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.13 867 M 6855 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.17 122 M 721 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.87 193 M 223 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.10 574 M 5862 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.58 76 M 130 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.14 966 M 6939 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.17 173 M 1038 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.40 44 M 110 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.10 559 M 5622 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.83 62 M 75 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.12 840 M 6846 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.10 630 M 6499 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.21 2483 K 12 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.41 58 M 144 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.90 196 M 217 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.13 887 M 6873 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.10 630 M 6498 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.24 2905 K 12 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.33 45 M 135 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 1.00 219 M 218 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.13 897 M 6864 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.10 629 M 6463 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.49 33 M 67 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.63 136 M 217 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.74 151 M 204 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.14 950 M 6952 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.10 627 M 6473 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.34 16 M 48 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.61 89 M 147 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.71 72 M 101 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.12 806 M 6770 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.14 644 M 4543 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.70 204 M 291 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.57 201 M 352 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.43 144 M 336 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.22 1194 M 5525 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1502 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.62 144 M 232 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.27 12 M 45 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.60 123 M 204 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.38 18 M 48 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.56 298 M 530 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.54 146 M 268 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.56 40 M 72 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.54 132 M 246 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.34 17 M 51 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.53 336 M 638 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Time elapsed: 1501 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.46 97 M 210 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.25 8685 K 34 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.39 77 M 201 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.23 10 M 44 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.39 193 M 490 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

 

imagem de Rolf Andersson

further to my post above, here is another run:

(tt.sh does ./tt 1 1000000000)

ring ran$ pcm-tsx.x ./tt.sh

Intel(r) Performance Counter Monitor: Intel(r) Transactional Synchronization Extensions Monitoring Utility

Copyright (c) 2013 Intel Corporation

Num logical cores: 4
Num sockets: 1
Threads per core: 2
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 4
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 3066666659 Hz

Detected Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz "Intel(r) microarchitecture codename unknown"
Update every 0 seconds

Executing "./tt.sh" command:

Exit code: 0

Time elapsed: 11983 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.12 3062 M 24 G 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
1 0.48 421 M 880 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
2 0.11 3117 M 28 G 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
3 0.40 523 M 1310 M 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.13 7124 M 55 G 0 ( 0.00%) 0 ( 0.00%) 0 0 N/A

Cleaning up
ring ran$

imagem de jimdempseyatthecove

Rolf,

One of the benificial characteristics of HLE is that it makes your code imune to the annoying (catostrophic?) problem of the lock holder thread being preempted by the O/S for interrupt or context switch. Thus blocking other thread's entry for duration of interrupt/preemption. This cannot happen using HLE, the preemption undoes the transaction while permitting other threads to pass throught the transaction section of code. IOW this alieviates a necessity of writing wait-free algorithm if this preemption avoidence becomes necessary.

Jim Dempsey

 

www.quickthreadprogramming.com
imagem de Rolf Andersson

Jim,
this is a very good point. While I originally got interested in RTM and HLE for the possibility of lower latency uncontended access, preemption is sometimes really bad if not catastrophic as you commented. Thanks for the insight.

I am still trying to get my head around the performance and latency aspects of HLE. The use case I'm looking at just now is the claim and publish steps of a multi-producer queue. Our current impl uses CAS (cmpxchgq), the prototype HLE version uses a spin lock loosely based on comments in various blog entries and other places (snippets enclosed below, cut and pasted, so I hope I got everything right).

It feels like I'm missing some important point here (or possibly some really trivial point ;)
Do you have any ideas as to what can be expected?

Best,
Rolf

Edit - added backslashes so that \\n is rendered correctly

typedef unsigned long u64;
#define __v64(x) ((volatile u64*) (x))
#define __HLE_ACQUIRE ".byte 0xf2 ; " 
#define __HLE_RELEASE ".byte 0xf3 ; "
static inline u64 __ia_cas64 (volatile void* data, u64 curr, u64 next)
{
 u64 prev;
asm volatile ("lock;cmpxchgq %1,%2"
 : "=a" (prev) // output
 : "r" (next), "m" (*__v64 (data)), "0" (curr) // inputs
 : "memory");
 return prev;
}
static inline void __hle_lock (volatile void* lock)
{
 u64 value = 1;
asm volatile ("1: " __HLE_ACQUIRE "lock; xchgq %0,%1\n"
" cmpq $0,%0\n" // prev == 0 ?
" jz 3f\n"
"2: pause\n" // abort transaction
" cmpq $1,%1\n" // lock == 1 ?
" jz 2b\n"
" jmp 1b\n"
"3: \n"
: "+r" (value), "+m" (*__v64 (lock))
:: "memory");
}
static inline void __hle_unlock (volatile void* lock)
{
asm volatile (__HLE_RELEASE "movq $0,%0"
: "+m" (*__v64 (lock)) :: "memory");
}
static inline u64 __hle_cas64 (volatile void* lock, volatile u64* data,
 u64 curr, u64 next)
{
 __hle_lock (lock);
 u64 temp = *data;
 if (temp == curr)
 *data = next;
__hle_unlock (lock);
 return temp;
}

imagem de Roman Dementiev (Intel)

Citação:

my hypothesis was that "lock;xchg" would incur coherence traffic and thus a number of cycles of latency (~access to LLC?), and that the lock elision operation somehow would hide that latency (or part thereof). I may have misunderstood how lock elision works.

in your single threaded (non-HLE) baseline test the cache line with the lock is always kept in the L1 local cache, therefore there is no LLC accesses or other expensive cache misses.

But if you run the non-HLE baseline on many cores then the xchgl accesses to the lock will experience cache misses since other cores will often have a more recent version of the cache line with the lock word. This more recent copy needs to be transferred to your core with "write permissions" before modifying the state of the lock. Sometimes it is referenced as coherency cache misses or lock cache line transfer/shipping overhead. With HLE, the lock word modification is elided (and not seen by other cores). The XACQUIRE xchgl operation does not issue the "write permission" request to other cores, therefore there is no coherency cache misses / lock shipping overhead. 

Roman

imagem de Roman Dementiev (Intel)

Citação:

I am still trying to get my head around the performance and latency aspects of HLE. The use case I'm looking at just now is the claim and publish steps of a multi-producer queue. Our current impl uses CAS (cmpxchgq), the prototype HLE version uses a spin lock loosely based on comments in various blog entries and other places (snippets enclosed below, cut and pasted, so I hope I got everything right).

I must say the idea to emulate CAS using TSX 1-to-1 is not a good one. 

Essentially Intel TSX is there to allow developers to avoid hard and error-prone thinking about how to express their higher-level algorithms and data structure operations in terms of low-level CAS to get high level of concurrency. Instead developers can use normal implementations using normal memory load and stores (as generated by any compiler by default) and pack their bigger higher-level data operations into TSX critical sections.

Roman

imagem de Roman Dementiev (Intel)

Citação:

To my other question about pcm-tsx not showing any transactional cycles, is there some other way to discern that the acquire and release operations have actually been executed?

please re-run pcm-tsx with the patch I had provided for MacBook Air. On my system with Intel(r) Core(tm) i7-4770 I see 100M HLE starts (#HLE column) and about 68% transactional cycles in pcm-tsx output for your microbenchmark.

pcm-tsx.x "./tt 1 100000000"

Thanks,

Roman

imagem de Rolf Andersson

Roman,

I get similar results with the patch applied.
Thanks for your assistance.

/Rolf

imagem de Rolf Andersson

Citação:

in your single threaded (non-HLE) baseline test the cache line with the lock is always kept in the L1 local cache, therefore there is no LLC accesses or other expensive cache misses.

This would be inconsistent with the numbers I get in the simple tests? The latency of the "xchg" instruction seems to indicate that there is a much higher latency than for other insns with a memory target. I thought this was due to locking -> LLC cache lockup?

I added two more test cases to tt.c, loops with:

"addl $1,%0" where %0 is a local stack address

"lock;addl $1, %0" where %0 is also a local stack address

I get 1.9 ns per iteration for the case without lock and 6.0 ns with the lock (the "xchg %0,%1" case yields ~10 ns)
this corresponds to 4ns and 8ns delta respectively, but I am currently unable to explain this difference ... 

I assume that the lock operation has some extra memory access overhead, even though I'm not entirely clear what is going on.

I'd very much appreciate if someone can shed some light on this.

/Rolf 

imagem de jimdempseyatthecove

Roman and Rolf,

I do not have a processor here for testing but I can make an observation and sugestion. Roman can counter the observation if it is wrong.

In Rolf's __hle_cas64 he is calling __hle_lock which is using the __HLE_ACQUIRE on a lock; xchgq ....
I believe, within the __HLE_ACQUIRE, the lock; xchg... is unnecessisarily bogging down the pipeline.

I think, (Roman please correct if I am wrong), the __HLE_ACQUIRE protected region could better be served by using BTS _without_ LOCK;

Rolf, it should be easy enough for you to setup a diagnostic to verify this.

Jim Dempsey

www.quickthreadprogramming.com
imagem de Rolf Andersson

Jim,

I did some more tests; the LOCK prefix seems (as expected from reading the doc) to be required in order for HLE to kick in. This makes sense as HLE will retry without elision if the transaction is aborted and then the LOCK is needed.

BTS with a LOCK prefix gives pretty much identical execution times compared to the XCHG version in the uncontended case.

Let me know if you would like me to post a new version of the test rig.

Best,
Rolf

imagem de Roman Dementiev (Intel)

Citação:

This would be inconsistent with the numbers I get in the simple tests? The latency of the "xchg" instruction seems to indicate that there is a much higher latency than for other insns with a memory target. I thought this was due to locking -> LLC cache lockup?

I added two more test cases to tt.c, loops with:

"addl $1,%0" where %0 is a local stack address

"lock;addl $1, %0" where %0 is also a local stack address

I get 1.9 ns per iteration for the case without lock and 6.0 ns with the lock (the "xchg %0,%1" case yields ~10 ns)
this corresponds to 4ns and 8ns delta respectively, but I am currently unable to explain this difference ... 

The LOCK prefixed instructions have an overhead (also if there are no cache misses) compared to those without LOCK. With Haswell uarch the LOCK-prefixed instructions take at least ~12 cycles because of it (SandyBridge had at least 16 cycles per LOCK).

Citação:

I assume that the lock operation has some extra memory access overhead, even though I'm not entirely clear what is going on.

You can count the L2 and LLC cache hits and misses using pcm.x and compare it with the number of iterations in your test. I think it will be a very low count.

Roman

imagem de Rolf Andersson

Roman, thx for the feedback. I will certainly have a look att using pcm for cache traffic instrumentation.

You mentioned the overhead/latency of LOCK prefixing;
Is there any info available on the overhead/latency of HLE_ACQUIRE and HLE_RELEASE?

Thanks,
Rolf 

imagem de Roman Dementiev (Intel)

yes. Please see the discussion about XBEGIN/XEND/XACQUIRE/XRELEASE latencies and overheads in Section 12.5 of the Intel Architecture Optimization Manual.

Roman

imagem de Rolf Andersson

it was that section that sparked my initial interest, specifically tuning suggestion 33:

Citação:

Tuning Suggestion 33.Intel TSX is designed for critical sections and thus the latency profiles of the XBEGIN/XEND instructions and XACQUIRE/XRELEASE prefixes are intended to match the LOCK prefixed instructions. These instructions should not be expected to have the latency of a regular load operation.

My initial tests were written to verify if XACQUIRE/XRELEASE would yield an overhead comparable to LOCK. Currently, the overhead doesn't seem to be comparable, but I will continue to run some more tests to see if I can find an explanation.

/Rolf

 

imagem de Roman Dementiev (Intel)

the explanation is just in the 1-2 paragraph(s) below.

Citação:

There is an additional implementation-specific overhead associated with executing a transactional
region. This consists of a mostly fixed cost in addition to a variable dynamic component. The overhead is
largely independent of the size and memory foot print of the critical section. The additional overhead is
typically amortized and hidden behind the out-of-order execution of the microarchitecture. However, on
the 4th generation Intel Core Processor implementation, certain sequences may appear to exacerbate
the overhead. This is particularly true if the critical section is very small and appear in tight loops (for
example something typically done in microbenchmarks). Realistic applications do not normally exhibit
such behavior.
The overhead is amortized in larger critical sections but will be exposed in very small critical sections.
One simple approach to reduce perceived overhead is to perform an access to the transactional cache
lines early in the critical section.

imagem de Rolf Andersson

the two paragraphs at the beginning of section 12.5 unfortunately provide very little guidance as to when HLE could be useful from a performance perspective.

The overhead is "typically" amortized and hidden ... "certain sequences" may appear to exacerbate ... if the critical section is "very small and appear in tight loops" ... "realistic applications" do not "normally" ... The overhead is amortized in "larger" critical sections but will be exposed in "very small" critical sections.

I have been able to find a number of situations in real code where HLE doesn't seem to be applicable as measured on the i7-4650. I have yet to find a single situation where I can demonstrate a clear performance advantage of HLE. I suspect that some of the tests that I have performed may yield different results on a 4-core CPU and multi-socket machine respectively.

Maybe my results so far are due to too low real parallellism?

Are there any relevant benchmarks that can be shared with the community?

Jim Dempsey's comment re preemption is obviously still valid. It may be that you would like to take some overhead in some situations to get protection against the preemption problem.

In the meantime, I will continue to look for elision opportunities in the code bases that I have access to.

Best,
Rolf 

Hi Rolf,

same surprise for me: when I use TSX instead of normal locking, my app runs ~7% slower.

I posted some details here:

http://software.intel.com/en-us/forums/topic/456300

Greetings,

Elmar

imagem de Rolf Andersson

Elmar, thx for the update.

I'm still trying to get my head around how TSX should be used. So far I've been unable to find a case that demonstrates a substantial improvement over the traditional locking schemes.

I'm also curious as to why there is so little specific info available re TSX performance, latency and intended use cases.

I would assume that Intel will release more info later.

Best,

Rolf

imagem de jimdempseyatthecove

>>I'm still trying to get my head around how TSX should be used.

It may be beneficial to list situations where TSX should not be used, such as situations were a single lock instruction works (lock; xchgadd [loc],val) and where the result is NOT used as simple mutex. This also includes CAS and DCAS.

Rolf, Elmar, I think it would be beneficial to setup a representative test situation where the protected transactions have a fair amount of computation (memory access), but not pressing the limitations of the TSX/HLE buffering system. Have the inter-protected runtime short. Run with several competing threads. The test is to see if when multiple threads enter the same protected region and modify/read conflicting cache lines:
a) one thread is winner (presumably first to exit)
b) all threads are losers

Hopefully the answer is a.

Jim Dempsey

www.quickthreadprogramming.com

Faça login para deixar um comentário.