Performance comparison using STM compiler

Performance comparison using STM compiler

Hi,

I have installed STM compiler and tried to run the 2 code snippet available on website.

Command line that compiled correctly: icc -openmp intel_stmtest_hashtable.c -o hashtable

Command line shown on website gave a warning "ignoring unknown option -Qtm_enabled" : icc -Qtm_enabled -openmp intel_stmtest_hashtable.c -o hashtable

Do I need -Qtm_enabled option to use the TM constructs ? If I need it then how should I give the option ?

Also can I get some benchmarks (at least one) where I can see the performance comparison of sequential code vs Coarse Grain Lock/Fine Grain Lock based OpenMP code vs Intel TM code ? I can take the TM constructs out from intel_stmtest_hashtable.c to see the sequential code vs Intel TM improvement but writing the code with locks will take some effort.

I also want to mention that the performance comparison mentioned in OOPSLA paper feels better than the Stanford's group work (who derived their TM from Sun's TL). I think getting TM performance comparable to coarse grain or fine grain lock is great :-) !

Thanks.

Deb

11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,

I installed Intel TM on two different machines :
1. A : 2 Dual Core Xeon Processors (Total logical cores = 4)
2. B : 4 Dual Core Xeon Processors (Total Logical Cores = 8)

Here is my STM version of the code given on website

intel_stmtest_hashtable.c : STM version
Command line to compile : icc -openmp intel_stmtest_hashtable.c -o hashtable

And I took out STM and OpenMP constructs to create a sequential version of the same code
intel_stmtest_hashtable_seq.c : Sequential version
Command line to compile : icc intel_stmtest_hashtable_seq.c -o hashtable_seq

Here are the runtimes:

1. A STM : 3.081s Sequential : 4.464s
2. B STM : 1.02s Sequential : 2.72s

Do you think these numbers look good ? B has a better runtime scaling than A but the scaling is not linear. I am not convinced with the numbers. May be I did something wrong.

Thanks.

Deb

It is really strange, that compiler rejects -Qtm_anbled option. Without this option TM is not used and thus your test runs are all without STM enabled.

Do you have only one Intel compiler installed on your machine, if not - please ensure that icc you're executing is the one of Intel STM C/C++ compiler.

-Qtm_enabled shall be passed to compiler at least during compilation phase (and at both phases if -ipo (Inter-Procedural Optimization) is used).

I can run the code with -Qtm_enabled on machine B. It has 8 cores on it.

icc -v shows Prototype Edition 2 so I am using the right icc version.

But now the results are even bad. Sequential version of the code runs in 2.70 s while the STM enabled version of the code runs in 2.84 seconds !

It will be great if you can give me a benchmark from your paper where I can see the runtime improvement ( a simple one like kmean for example). My email id is ddas@northwestern.edu

Thanks.
Deb

Quoting - debasish83

I am just curious, what performance and scalability you are expecting?

Note that many benchmarks in the paper use corse-grained work items with substantial amount of local processing.

Btw, STAMP benchmarks that were used in the paper you can download here:

http://stamp.stanford.edu/

(but you will have to manually rewrite them to Intel STM Compiler...)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - randomizer

I am just curious, what performance and scalability you are expecting?

Note that many benchmarks in the paper use corse-grained work items with substantial amount of local processing.

Btw, STAMP benchmarks that were used in the paper you can download here:

http://stamp.stanford.edu/

(but you will have to manually rewrite them to Intel STM Compiler...)

If you look at Intel's paper at OOPSLA, Figure 11 (b), With 8 threads, o(ptimistic)STM and p(essimistic)STM on an average gets around 6X improvement over a single thread running the same application. I am also looking for similar performance on a 8 core system. I manually changed the hashtable example from STM 2.0 website to sequential code and saw the performance numbers of STM code and the sequential code but I don't see such an improvement. May be hashtable is not the right example to do the performance comparison of STM.

Perhaps I should write a simple code that has substantial amout of local processing (matrix multiplication for example) and see if I can reproduce 6X improvement over the same code running with STM on a single processor. Manually writing those benchmarks will take time and moreover since Intel STM guys has already done that I was hoping they can release it somewhere.

Let me know your suggestions.

Thanks.

Deb

Quoting - debasish83

If you look at Intel's paper at OOPSLA, Figure 11 (b), With 8 threads, o(ptimistic)STM and p(essimistic)STM on an average gets around 6X improvement over a single thread running the same application. I am also looking for similar performance on a 8 core system. I manually changed the hashtable example from STM 2.0 website to sequential code and saw the performance numbers of STM code and the sequential code but I don't see such an improvement. May be hashtable is not the right example to do the performance comparison of STM.

Perhaps I should write a simple code that has substantial amout of local processing (matrix multiplication for example) and see if I can reproduce 6X improvement over the same code running with STM on a single processor. Manually writing those benchmarks will take time and moreover since Intel STM guys has already done that I was hoping they can release it somewhere.

Ok. I see.

I think the problem exactly in benchmarks (more precisely in amount of local processing). On fig 11(b) you can see that even coarse-grained locking performs up to 7 times better on 8 cores than on 1 core. This strongly suggests that benchmark has *substantial* amount of local processing.

So I think you can make following. Insert into your hash map benchmark some amount of local processing - just empty loop, for example. Tune amount of local processing so that coarse-grained locking version will have, for example, 6x speed-up over single-threaded execution. Then run STM version with the same amount of local processing. I think then you will see speed-up around 5-7x.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Hi,

I run a matrix multiplication code using switches -Qtm_enabled and -openmp. and I got expected speedups. Now if I provide both switches, I use openmp pragma to parallelize for loop and shared variables are protected by transactional memory right ?

For matrix multiplication there are no shared writes and so I didn't need __atomic in my program. But now I will experiment with atomic keyword.

Now here are some runtimes number (runtime in seconds, CPU usage)

- matrix_dimension=1000 -> stm=(0:03s, 406%) vs. seq=(0:12s,100%)
- matrix_dimension=2000 -> stm=(0:20s, 568%) vs. seq=(1:38s,100%)

This machine is a 2 quad core CPU. So I should get 7xx% of CPU usage. Is there a switch to the compiler which allows me to generate code that for example use 8 processes (for example OpenMP allows me to do that) ? Or for scalability STM doesn't allow user to control finer details like processes in the code ?

Thanks.

Deb

Quoting - debasish83

Now here are some runtimes number (runtime in seconds, CPU usage)

- matrix_dimension=1000 -> stm=(0:03s, 406%) vs. seq=(0:12s,100%)
- matrix_dimension=2000 -> stm=(0:20s, 568%) vs. seq=(1:38s,100%)

This machine is a 2 quad core CPU. So I should get 7xx% of CPU usage.

Don't confuse CPU usage with speedup. You can have 800% CPU usage, but only 50% speedup (i.e. slowdown). It's perfectly Ok for multi-threaded program on multicore hardware.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - debasish83For matrix multiplication there are no shared writes and so I didn't need __atomic in my program.

Is there a switch to the compiler which allows me to generate code that for example use 8 processes

If you don't use STM in your program, how switches to STM compiler can affect anything?

Btw, STM doesn't create any threads, it works only on behalf of your threads. So you just can manually create as many threads as you want, for example, 8, and STM will "use" 8 threads, so to say.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Dmitriy V'jukov

Don't confuse CPU usage with speedup. You can have 800% CPU usage, but only 50% speedup (i.e. slowdown). It's perfectly Ok for multi-threaded program on multicore hardware.

Hmmm.... what do you mean by CPU usage here? And how do you measure it?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Leave a Comment

Please sign in to add a comment. Not a member? Join today