CPU support for STM?

CPU support for STM?

Portrait de Deleted user

Are there any special hardware features that STM relies on to work or to be efficient? I am currently trying the STM compiler on my home desktop that runs on AMD Athlon and I'm seeing a large overhead. Would migrating to an Intel CPU provide a substancial reduction in the overhead?

Btw, will Intel be adding support for TM acceleration in future CPUs?

Dmitry

4 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de James Cownie (Intel)

As the name suggests (STM == Software Transactional Memory), the implementation is entirely in software.


Since we haven't benchmarked on AMD processors, I have no way of knowing whether running your code on Intel processors would improve the performance or not. There are certainly no Intel specific instructions used in the implementation.


It is not hugely surprising that an STM has significant overhead, if you consider that most loads and stores to every potentially shared object have to be converted into subroutine calls.


Note, too, that the main focus of interest for this release is in the ease of programming, and the language features which support that, not in performance (though we do have some benchmarks which perform better than coarse grained locking when using four cores).


Btw, will Intel be adding support for TM acceleration in future CPUs?

Do you seriously expect an answer to that !? I'm certainly not going to give you one

Portrait de Deleted user

Thanks for prompt and informative replies, Jim.

When I asked about hardware support, I had cache coherency in mind. I thought you might use some feature of that protocol to support violation detection.

Is there a way to mark variables as "transaction-local" so that their accesses don't go through library calls? Or making variables local to tm_callable function is sufficient?

It would be great if you provided a high-level implementation details of the STM compiler support and library to give us a better understanding of the feature. I think most people who are interested in STM are ultimately interested in _performance_ improvements. Knowing what's going on under the hood will allow us to use it more efficiently.


Dmitry

Portrait de James Cownie (Intel)

When I asked about hardware support, I had cache coherency in mind. I thought you might use some feature of that protocol to support violation detection.


Indeed, there are many research papers which discuss how a processor vendor might do that, and you can even find papers by Intel authors which discuss how one might accelerate a TM implementation using hardware. However I am in no position to give you any information about what Intel might implement in future products.



Is there a way to mark variables as "transaction-local" so that their accesses don't go through library calls? Or is making variables local to tm_callable functions sufficient?


As you speculate, the compiler is smart enough to realize that local variables do not (in general) need to have their accesses instrumented (though, of course, in the transaction scope itself additional code is still required to save their values before a transaction so that they can be restored if the transaction has to abort and retry).


The precise details of which accesses need to be instrumented depends on the compiler optimizations in place, which are under continuous development, so it's not easy to give a definitive description. If you are really interested you can look at the assembly code, and see what's going on...


The interface between the compiler and the runtime is currently designed to be very general, so it can support multiple different STM algorithms so that we can compare their performance without requiring changes to the compiler. If it turns out that there is one best algorithm, then we may be able to optimize it slightly.



I think most people who are interested in STM are ultimately interested in _performance_ improvements.


It seems unlikely to me that an STMcan ever outperform a well written, fine grained locking implementation, so if you were prepared to spend the effort to implement fine-grained locking in your code, then STM has little to offer. The potential benefit of STMis that the programming model is easier to use than fine-grained locking (much more like using a single global lock), so you can write your code faster, and the performance of the STM may be better than that of the single global lock when you run on many processors.


As I have said in previous posts, the main aim of this What-If release is to try to find out whether the TM programming constructsare sufficient, (or too painful with all the declspecs), and whether they do make writing code easier.


Performance was not a primary objective.


HTH


Connectez-vous pour laisser un commentaire.