Understand Concurrency Analysis results correctly: Bottom-up and Summary

Intel® Parallel Amplifier provides Concurrency Analysis which helps to find where processor utilization is poor. Concurrency results may indicate 1) “Idle” time is bad for too long (no thread to run) – “gray” color. 2) “Ok” time is bad for too long (serial code to run) – “orange” color. 3) “Ideal” time is for time spent where the number of active threads matches the number of available hardware's cores - "green" color.  4)“Over” time is for time spent where the number of active threads is more than the number of available hardware's cores - "blue" color. We recommend the user's code to work mostly on "Ideal" time to utilize the processors perfectly.

 

Here is a simple example named Primes, which finds/counts total number of primes from 1 to 100,000 by using four threads in parallel.

bottom-sum1.JPG


See above Concurrency Analysis results – Elapsed Time is 1.152s, CPU Time is 1.932s. The user should understand that CPU Time is to accumulate all logical CPUs time – which were collected on serial code and parallel code. CPU Time is NOT application running time.


The column expansion button (marked ">>") on the "CPU Time by Utilization" column hearder will separate the concurrency levels into separate columns.  Expanded data columns for concurrency level are 1, 2, 3, 4, etc - shown as seen below.
 

bottom-sum2.JPG

 

Actually Logical CPU Count is 2 in this case, so we can understand Elapsed Time which is calculated by the formula below –

Elapsed Time = 0.259s + 0.265s/2 + (0.437s/3)*2 + (0.971s/4)*2 = 0.259s + 0.132s + 0.291s + 0.485s = 1.167s

Please note: 

1. All time spent in serial code will contribute its 100% to elapsed time: T1 = 0.259s

2. Time spent in parallel code will contribute to elapsed time as: (T/threads) * factor, factor = (threads/CPUs) up to integer

a) IDEAL: factor = 1, T2 = T/2

b) Paralle in three threads: factor = 2, T3 = (T/3)*2

c) Parallel in four threads: factor = 2, T4 = (T/4)*2

 

This result is approximate to “Summary” information

For more complete information about compiler optimizations, see our Optimization Notice.

4 comments

Top
Peter Wang (Intel)'s picture

Dear Akio,

For test case - 3 threads work parallel on 2 CPUs. The scheme is decided by OS, for example it may be as:
T1, T2
T2, T3
T1, T3
T2, T4
T1, T5
...

Please thinking that time spent in Concurrency Level (CL) 3: three threads are active (ready), scheduled by OS. Those threads are (or aren't) terminated in CL:3 time, may be terminated during CL:2 time?

Don't think CL:3 data is from time line, that is accumulated from all pieces of CL:3 region of time line.

Regards, Peter

Akio Yasu (Intel)'s picture

Hello Peter,

Thank you for your answer.

I would like to imagin the case of 3 threads on 2 CPUs.

Here is what I thought originally.
CPU1: (Thread1)+(Thread3)/2
CPU2: (Thread2)+(Thread3)/2
In this case the factor is 3/2=1.5.

But it should work as below for instance?
CPU1: (Thread1)+(Thread3)
CPU2: (Thread2)
In this case the factor is 3/2 up to integer = 2.

Peter Wang (Intel)'s picture

This is approximate data in the formula, factor = (Threads/CPUs) up to integer.

Thinking -

If CPU Count is 2 - 2 parallel-threads takes 1 time to fully utilize CPU; 3 parallel-threads takes 2 times to fully utilize CPU; 4 paralle-threads takes 2 times to fully utilize CPU, 5 threads takes 3 times...

If CPU Count is 4 - 2..4 parallel-threads takes 1 time to fully utilize CPU; 5..8 parallel-threads takes 2 times to fully utilize CPU

That is I meant and saying "factor"

You couldn't simply use (T/CPUs) to compute time spending - the result is inaccurate

Akio Yasu (Intel)'s picture

I wonder if the elapsed time is as (T/threads) * factor and where factor = (threads/CPUs) , it could be as below:

(T/threads) * factor = (T/threads) * (threads/CPUs) = (T/CPUs)

I wonder why the factor will be round up to interger...

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.