_tm_release() in STM Compiler/ABI

_tm_release() in STM Compiler/ABI

Imagen de Dmitry Vyukov

I had some conversations with Serge Preis regarding "partial commits" and "partly overlapping transactions" which are intended to increase STM scalability for manipulations on linked-lists, trees, etc. Following proposal is a reincarnation of those ideas. It's inspired by (directly borrowed from) AMD's ASF proposal (http://developer.amd.com/cpu/ASF/Pages/default.aspx). They propose a nice way to aviod read-set explosion.
The idea is to add _tm_release() function into STM Compiler and/or _ITM_Release() function into STM ABI. Inclusion into STM ABI has more preference since allows future support in STM compiler[s]. The function is intended solely for optimization, thus may be implemented as a no-op in run-time and not used/exposed by compiler.
The semantics of the function is that it [optionally] releases single memory location from a read-set. If indicated memory location is in the write-set then function makes no action.
The idea behind is that if some memory location is used only to locate and acquire another memory location, then the former mamory location may be released from the read-set as soon as the latter memory location is acquired. Holding only the lattter memory location in the read-set must be enough to ensure user-level consistency. Though verification of the previuous statement is on a user's conscience (incorrect usage of the function may sacrifice correctness).
Here is a sketch of the iteration over linked-list (when node is removed from the list it's links must be zeroized):
node* cur, next, prev;
...
next = _tm_read(cur->next);
_tm_release(prev);
prev = cur;
cur = next;
...
So, regardless of the size of the list we are always holding no more than 3 nodes in the read-set. The trick can be applied to trees as well. The ultimate goal is to keep size of the read-set constant.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
publicaciones de 7 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de James Cownie (Intel)
Quoting - Dmitriy Vyukov I had some conversations with Serge Preis regarding "partial commits" and "partly overlapping transactions" which are intended to increase STM scalability for manipulations on linked-lists, trees, etc. Following proposal is a reincarnation of those ideas. It's inspired by (directly borrowed from) AMD's ASF proposal (http://developer.amd.com/cpu/ASF/Pages/default.aspx). They propose a nice way to aviod read-set explosion.
The idea is to add _tm_release() function into STM Compiler and/or _ITM_Release() function into STM ABI. Inclusion into STM ABI has more preference since allows future support in STM compiler[s]. The function is intended solely for optimization, thus may be implemented as a no-op in run-time and not used/exposed by compiler.
The semantics of the function is that it [optionally] releases single memory location from a read-set. If indicated memory location is in the write-set then function makes no action.
The idea behind is that if some memory location is used only to locate and acquire another memory location, then the former mamory location may be released from the read-set as soon as the latter memory location is acquired. Holding only the lattter memory location in the read-set must be enough to ensure user-level consistency. Though verification of the previuous statement is on a user's conscience (incorrect usage of the function may sacrifice correctness).
Here is a sketch of the iteration over linked-list (when node is removed from the list it's links must be zeroized):
node* cur, next, prev;
...
next = _tm_read(cur->next);
_tm_release(prev);
prev = cur;
cur = next;
...
So, regardless of the size of the list we are always holding no more than 3 nodes in the read-set. The trick can be applied to trees as well. The ultimate goal is to keep size of the read-set constant.

Unfortunately, although the idea is appealing, the implementation is extremely complex (read "slow"). To make locking fast we use imprecise locks,so multiple addresses can hash to the same lock. Without precise locking, or a reference count in each lock, it is impossible to drop a lock based on the locked address, since the lock may also need to be held to guard some other address. So, to be able to implement this you'd have to slow down the common cases significantly.

Imagen de Dmitry Vyukov
Quoting - James Cownie (Intel)

Unfortunately, although the idea is appealing, the implementation is extremely complex (read "slow"). To make locking fast we use imprecise locks,so multiple addresses can hash to the same lock. Without precise locking, or a reference count in each lock, it is impossible to drop a lock based on the locked address, since the lock may also need to be held to guard some other address. So, to be able to implement this you'd have to slow down the common cases significantly.

I guess you have to track read- and write-sets anyway. Write-set is required as read/undo log, and read-set is required for consistency verification. So maybe _tm_release() may just scan read- and write-set in order to capture all needed information? In the important cases of linked-list and tree iteration write-set is empty-set, and read-set always contains just few elements. If read/write sets are too big, implementation may just refuse release operation.

However I remember I was reading a paper about Intel STM implementation... there was something about 5 modes of execution... does implementation tracks read/write sets in all modes?... sorry, I can's remember anything more, it's already 3AM here. I guess _tm_release() does not make sanse in all the modes.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Imagen de James Cownie (Intel)
Quoting - Dmitriy Vyukov

I guess you have to track read- and write-sets anyway. Write-set is required as read/undo log, and read-set is required for consistency verification. So maybe _tm_release() may just scan read- and write-set in order to capture all needed information? In the important cases of linked-list and tree iteration write-set is empty-set, and read-set always contains just few elements. If read/write sets are too big, implementation may just refuse release operation.

However I remember I was reading a paper about Intel STM implementation... there was something about 5 modes of execution... does implementation tracks read/write sets in all modes?... sorry, I can's remember anything more, it's already 3AM here. I guess _tm_release() does not make sanse in all the modes.

The issue is with the read-set. In the pessimistic runtime the read-set exists solely so that the transaction can drop the read-locks on completion (however that happens). Therefore entries are only made in the read-set for the first address which maps to a given lock. If the lock is seen already to be locked, no entry is made in the read-log. Therefore it is impossible to determine the number of addresses which we have accessed each of which requires this read lock to be held.

Of course, you could change the locking scheme to include a per-thread reference count for each read-lock, but maintaining this is going to add cost solely to support this new function, which is unlikely to be a good trade-off...

Imagen de jimdempseyatthecove

Dmitriy,

>>So, regardless of the size of the list we are always holding no more than 3 nodes in the read-set. The trick can be applied to trees as well. The ultimate goal is to keep size of the read-set constant.

When you are not worried about fairness, and are willing to take a perspective a little askew traditional thinking, then the lock technique can be reduced to holding a pointer and the node to which it points. (i.e. one node)

Assume singly linked list:

obj* head;

where obj contains "obj* next;" and list terminates in null.

Should you choose to use the technique of setting the lsb of the pointer to 1 (provided it was 0) to acquire the lock and back to 0 to release the lock (the 2nd lsb could be used for reservation or read-only or what not).

Then if you declare the holder (thread that successfuly transition the lsb of the pointer from 0 to 1) holds both the pointer and the object to which it points (excepting for the pointer in the object to which it points). Then you have reduce the hold to one node (sans pointer within the node held, and inclusive of the pointer to the node held stored in the prior node or head).

To modify a node: acquire the lock on (lsb in) pointer to node, then modify anything within node excepting for pointer in node, with exception to that of obtaining lock with that node pointer.

To insert node: acquire lock on node in front of the insertion. i.e. if you want to insert between A and B, A containing pointer to B then you lock B using pointer held in A. You now own the pointer in a and also own B (but without intentions of modifying B) therefore while lock held, B cannot be deleted, and (locked) pointer in A cannot point anywhere else but to B (and is locked as well). The only additional rule is you cannot delete a node without acquiring the lock by way of the pointer to the node.

To delete node B of A, B, C, acquire lock of pointer to C within B, acquire lock of pointer to B within A, now you own the pointer to B within A, you own B, you own the pointer to C within B and you own C. B can now be removed safely. Note, when you are unable to get the second lock you would have to add code to avoid deadlock and/or lengthy lock (i.e. release and retry, and/or use the next bit for reservation, etc...)

To append a node, lock the NULL pointer (or reduce to CAS of next pointer and NULL).

Fairness can be added to some extent whereby you attain fairness within the threads failing to obtain the lock, but not fair with respect to those threads not yet attempting a lock. On failure you create/reuse a sequencing queue based on a has of thelocation(value) of the pointer. But this means added overhead but this overhead can be restricted to only those threads in the failure to acquire list.

Jim Dempsey

www.quickthreadprogramming.com
Imagen de jimdempseyatthecove

I might add that adjacent to the head you can maintain a linked list of pending locks for nodes within the list.

obj* head = NULL;
pendingLock* lockList = NULL;

struct pendingLock
{
struct pendingLock* next;
void* node;
};

Then when a thread first attempts a lock they attempt the interlocked action to set the lsb (itcould look first if you wish), on success, the thread goes about its business. On failure, a pendingLock struct is created on the stack where it contains in nodd the address of the node wishing to be locked. This node is now appended into the lockList (recursively using the same lock technique). The fairness spin technique scans the lockList for a node pointing to the node the thread wants. If the first node with the correct pointer is not your node then you _mm_wait() or SwitchToThread() or.... If the first node with the correct pointer is your pendingLock node, then you attempt a lock, if this lock fails then new thread got lock, you wait and try again. If the attempt lock succeeds, then remove your pendingLock node from the lockList (for that head).

The length the lockList for the particular head will generally not exceed the number of threads. In lieu of a linked list you could use a fixed size ring buffer containing say the node address intending to be locked plus a sequence number.

www.quickthreadprogramming.com
Imagen de Dmitry Vyukov
Quoting - James Cownie (Intel) The issue is with the read-set. In the pessimistic runtime the read-set exists solely so that the transaction can drop the read-locks on completion (however that happens). Therefore entries are only made in the read-set for the first address which maps to a given lock. If the lock is seen already to be locked, no entry is made in the read-log. Therefore it is impossible to determine the number of addresses which we have accessed each of which requires this read lock to be held.

Of course, you could change the locking scheme to include a per-thread reference count for each read-lock, but maintaining this is going to add cost solely to support this new function, which is unlikely to be a good trade-off...

Well, I see that you are not very enthusiastic on this... Anyway there is no run-time cost associated with adding this into ABI. Any run-time implementation will be able to not implement release() at all. Plus compiler static analysis may help by providing "trx does not use release" hint to a run-time which decide to implement release().
Unfortunately I am unable to make versatile benchmarking now, which must drive such decisions.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Inicie sesión para dejar un comentario.