STM Compiler: Great Work!

STM Compiler: Great Work!

I've read the paper about implementation of STM in Intel STM Compiler:
http://www.adamwelc.org/papers/oopsla08.pdf

Great Work!
Compiler support really differentiates your STM from other *library* solutions. All those manual calls to tm_read/tm_write are really annoying and error prone.
Especially I'm impressed by barrier optimizations (readAR, readAW, writeAW, readFW). Though I understand that maybe it's not most difficult part, provided that compiler already has all needed information. STM libraries can also provide similar functions, but it will be even more annoying and error prone to end user.

Also it's interesting that runtime can switch transaction from optimistic mode to pessimistic mode in-flight.

Btw, I'm curious why compiler can't automatically determine that function is pure? It's certanly impossible for all cases, but for trivial cases it must be relatively easy. I.e. if I have function like:
int square(int x) {
return x * x;
}
As I understand, w/o manual annotations compiler will threat this function as tm_callable (if it can statically determine that this function is called from inside transaction, and generate 2 versions of function). Why compiler can't determine that it is pure w/o manual annotation?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I have some ideas how compiler supported STM can be improved. I am not sure how useful will be those improvements, and whether it's possible to implement them. So I would like to hear your opinion on them.

Assume we have following transaction:
__tm_atomic
{
// push node to stack
node_t* n = some_unknown_function_maybe_IO_or_OS_memory_management();
n->next = stack_head;
stack_head = n;
// pop node from another stack
if (another_stack_head)
{
node_t* n2 = another_stack_head;
another_stack_head = n2->next;
another_unknown_function_maybe_IO_or_OS_memory_management(n2);
}
}

STM compiler can rewrite this code as:
// move out of transaction
node_t* n = some_unknown_function_maybe_IO_or_OS_memory_management();
node_t* n2 = 0;
__tm_atomic
{
// push node to stack
n->next = stack_head;
stack_head = n;
// pop node from another stack
if (another_stack_head)
{
n2 = another_stack_head;
another_stack_head = n2->next;
}
}
// move out of transaction
if (n2)
another_unknown_function_maybe_IO_or_OS_memory_management(n2);

So basically compiler can move opening and trailing calls to tm_unknown function out of the transaction. Thus we avoid the need to serialize total execution.

But I am not sure why currently transactions which call tm_unknown function need to serialize. Because it's not possible to 'rollback' tm_unknown function, or because STM must provide SLA and for tm_unknown functions too. What is your intent? If latter, then my proposal will break SLA for some tm_unknown functions. If former, then my proposal can significantly improve performance of some transactions.
For example if another_unknown_function_maybe_IO_or_OS_memory_management() is WinAPI's VirtualAlloc() then it's always safe to make such transformation, because it's internally synchronized anyway.
And yes, I perfectly understand that user can make such transformation manually :)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Another improvement is something that I call 'transactional contexts'.
Assume we have following class definition:

class concurrent_stack
{
private:
 node_t* m_head;

public:
 //...
 void push(void* data)
 {
 __tm_atomic
 {
 // work with m_head
 }
 }

void* pop()
 {
 __tm_atomic
 {
 // work with m_head
 }
 }
};

Assume that concurrent_stack::push() and concurrent_stack::pop() functions 'touch' only function parameters, and members of concurrent_stack object. If STM compiler will be able to determine this, then it will be able to make some considerable optimizations, which can significantly improve scalability of the system.

If STM compiler determines that only concurrent_stack::push() and concurrent_stack::pop() functions can possibly touch members of concurrent_stack object, and those functions don't touch other shared data, then compiler can define those functions along with concurrent_stack object as 'transactional context'. Transactional context by definition can't conflict with the rest of the world, and rest of the world can't conflict with individual 'transactional contexts'. All transactions which don't relate to particular 'transactional context', or relate to several 'transactional contexts', can be called 'global transactional context'.

If execution of transaction in 'global transactional context' causes STM run-time to switch to serial mode, then all individual transactional contexts can be NOT serialized. If execution of transaction in individual 'transactional context' causes
STM run-time to switch to serial mode, then all remaining individual
transactional contexts and global transactional contexts can be NOT serialized.

If transaction in individual transactional context waits for quiescence it can wait ONLY for other transactions in the SAME transactional context. If transaction in global transactional context waits for quiescence
it can wait only for other transactions in global transactional context.

What I am trying to achieve is efficiency of following plain old mutex based synchronization:

class concurrent_stack
{
private:
 node_t* m_head;
mutex_t m_mutex;
public:
 //...
 void push(void* data)
 {
m_mutex.lock();
 // work with m_head
        m_mutex.unlock();
 }

void* pop()
 {
        m_mutex.lock();
 // work with m_head
        m_mutex.unlock();
 }
};

Assume I have 1000 such concurrent_stack objects. Each object has it's own mutex, so operations on different object will not correlate. If I am using STM in it's current version, then operations on different objects will correlate - they will be serializing each other, they will be waiting for queiscence for each other etc. This totally destroys scalability of the system.
What do you think?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Dmitriy,

First of all thank you for your kind words. We (entire Intel STM team) is always glad to get such positive feedback.

Now to your questions and proposals:

1. Compiler is able to mark some functions as tm_pure and it already does. I don't remember whether this feature was a part of WhatIf 2.0 relase or it was added later, but it will be defeninitely included into our next release expected by the end of the year. Unfortunately, the main problem here is cross-file function calls: if function in some file found to be tm_pure by compilercalls to this function from other files will not know about this.Normally for C/C++each file is compiled separately and such compiler-deduced mark is not part of the interface (unlike mark in header file). So the major benefit from deducing of tm_pure property will be when Inter-ProceduralOptimization (link-time compilation) is used. Also, without manual annotation by default all functions are treated as 'tm_unknown', but functions with seen bodies and called from atomic block are converted to tm_callable (seen body is the limitation discussed above).

2. __tm_atomic construct defenitely implies SLA for entire body of the construct, so we cannot pull any function calls out of atomic block. In this regard you may see __tm_atomic{} is a critical section: if you put something into it - you mean to sinchronize execution of this 'something' and pulling it out will violate synchronization leading to race conditions. Even if f1 and f2 are'internally synchronized "lock {f1;}; lock {f2;};" is not equivalent to "lock{f1;f2;}" because some code may execute between 2 locks in first case, and not in second case.

Thank you for your feedback once again, I will answer your last post later today.

Regards, Serge.

Thanks for your feedback. Regardingyour suggestion on auto-detection of tm_pure function, we had thought about it andimplemented it a while back for our future compiler release.

Xinmin Tian (Intel)

MADspreis:

1. Compiler is able to mark some functions as tm_pure and it already does. I don't remember whether this feature was a part of WhatIf 2.0 relase or it was added later, but it will be defeninitely included into our next release expected by the end of the year.

MADspreis:

Ok. I see. So I wasn't all that far out :)

Unfortunately, the main problem here is cross-file function calls: if function in some file found to be tm_pure by compilercalls to this function from other files will not know about this.Normally for C/C++each file is compiled separately and such compiler-deduced mark is not part of the interface (unlike mark in header file).

I think, it's perfectly Ok. Because the same applies to all functions. If user wants maximum performance, he must put function into header. It is natural for C/C++ developers.

MADspreis:

So the major benefit from deducing of tm_pure property will be when Inter-ProceduralOptimization (link-time compilation) is used.

I think this is also natural in the context of C/C++.

MADspreis:

Also, without manual annotation by default all functions are treated as 'tm_unknown', but functions with seen bodies and called from atomic block are converted to tm_callable (seen body is the limitation discussed above).

But if I will enable Inter-ProceduralOptimization, will compiler/linker be able to transform calls to 'external' functions from tm_unknown to tm_callable?

MADspreis:

2. __tm_atomic construct defenitely implies SLA for entire body of the construct, so we cannot pull any function calls out of atomic block. In this regard you may see __tm_atomic{} is a critical section: if you put something into it - you mean to sinchronize execution of this 'something' and pulling it out will violate synchronization leading to race conditions. Even if f1 and f2 are'internally synchronized "lock {f1;}; lock {f2;};" is not equivalent to "lock{f1;f2;}" because some code may execute between 2 locks in first case, and not in second case.

Ok, I see. My proposal will break the semantics.
Well, I think that recommendation that user must move, for example, calls to malloc()/free() out from transaction is already in the documentation.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Best Reply

Dmitriy, thank you again for your thoughtful posts. First of all some comments on you latest message.

Dmitriy V'jukov:But if I will enable Inter-ProceduralOptimization, will compiler/linker be able to transform calls to 'external' functions from tm_unknown to tm_callable?

Yes, when inter-procedural optimization is enabled compiler builds entire application as it weresingle file, so all functions in the application are seen at compile time. Thus attributes assigned at oneuse of routine (or at its definition) will be visible at all other uses and at function definition. So if function is called from atomic block at one place (even if the fileisdifferent from its place of definition)it will be declared as tm_callable and compiled accordingly, almost the same for tm_pure: if function body is found to be tm_pure all calls to this function will see this and use this information.

Dmitriy V'jukov:Well, I think that recommendation that user must move, for example, calls to malloc()/free() out from transaction is already in the documentation.

In fact malloc() and free() do not cause serial execution - their special transactional versions are called instead. But for other fuctions this rule of course holds and weencorage users to do so.

Transactional contexts is a good idea to improve serial and mixed serial/STM performance. However, by now we used to caremore about TM performance in STM modes. Maybeat the next stage of development we will shift our focus. In some cases serial execution wins over transactional one because of smaller overhead. We evenexperimented with someoptimizations forcing serialmode incertain corner cases. However currently serialized execution has significant start-up costs (partially because of total lock globality you're writing about, partially because STM modes are preferred) and thus full potential of serial mode yet undiscovered.

Another important note about your idea is that we're implementing word-based STM system (in contrast to object-based ones) and so we're processing not calsses, methods, objects and fields (non-existent after C++ Front-end), but memory locations. So it is not that easy to natuarlly extract transactional contexts from the low level representation we work upon.However, maybe we will do something in this direction to improve our serial execution performance or in some other project targeted primarilly to usuallocks performance.

Thank you for sharing your ideas with us,
Serge.

Quoting - Serge Preis (Intel)

Transactional contexts is a good idea to improve serial and mixed serial/STM performance. However, by now we used to caremore about TM performance in STM modes. Maybeat the next stage of development we will shift our focus. In some cases serial execution wins over transactional one because of smaller overhead. We evenexperimented with someoptimizations forcing serialmode incertain corner cases. However currently serialized execution has significant start-up costs (partially because of total lock globality you're writing about, partially because STM modes are preferred) and thus full potential of serial mode yet undiscovered.

Yes, if 'transactional contexts' will be small enough, then it will be even possible to just associate mutex with every context, and execute program exactly as mutex-based. Deadlocks will be eliminated 'by design', because compiler checks that 'transactional contexts' don't overlap.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Serge Preis (Intel)Another important note about your idea is that we're implementing word-based STM system (in contrast to object-based ones) and so we're processing not calsses, methods, objects and fields (non-existent after C++ Front-end), but memory locations. So it is not that easy to natuarlly extract transactional contexts from the low level representation we work upon.However, maybe we will do something in this direction to improve our serial execution performance or in some other project targeted primarilly to usuallocks performance.

Maybe compiler can pass transactional context identity to low word-based level. Something like:

template

T tm_read(T* addr, size_t trx_context);

And then compiler translates read inside transaction:

void my_class::my_func() {

x = *y;

}

to:

void my_class::my_func() {

x = tm_read(y, hash(this));

}

Or better transactional context will be associated with transaction descriptor. I.e. on transaction begin compiler fills transactional context in transaction descriptor. Then transactional reads and writes can use it.

Note that it's Ok to combine several transactional contexts into one (for example, for which hash function returns the same value).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - randomizer

Or better transactional context will be associated with transaction descriptor. I.e. on transaction begin compiler fills transactional context in transaction descriptor. Then transactional reads and writes can use it.

User can supply special attribute:

__declspec(tm_transactional_context)
class queue
{
public:
 void push(void* data);
 void* pop();
};

Then compiler can check that methods of that class don't access other objects, only 'this' and method parameteres. Also compiler must check that other functions (not methods of that class) don't access data members of 'queue' class... Hmm... it can be problematic, if link time code generation turned off... But in any case compiler can just ignore tm_transactional_context attribute.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Dmitriy VyukovNote that it's Ok to combine several transactional contexts into one (for example, for which hash function returns the same value).

I.e. there are 2 possibilities.

First. Used object holds personal transactional context descriptor:

class user_class
{
 // auto emitted member, placed at offset "-1" relatively to 'this' pointer
 __tm_trx_context_desc __trx_context_desc;
 // other user members
};

Second. STM run-time contains fixed-size array of transactional context descriptors, and user object is mapped to that array with hash function:

// STM run-time
size_t const __trx_context_desc_count = 1024;
__tm_trx_context_desc __trx_context_desc [__trx_context_desc_count];
size_t map_object_to_trx_context(void* obj);

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Leave a Comment

Please sign in to add a comment. Not a member? Join today