performance testing

performance testing


given a small matrix (100 / 100) and a large number of threads (80). the overhead of creating these threads is quite alot compared to a single thread solving the problem. any idea on how we can handle this?

is the intel team interested in viewing how does the problem scale with small data and large number of threads? or their only interest is large dataset with small/large number of threads?


13 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

As I undertand from previous posts, they want a generic solution that'll run in minimal time. Meaning that for any input it should be as fast as possible (regardless of how many threads it is using).So it is wiser to use a single thread to solve a small problem rather than multiple.

well...yeah. but how can you define a small matrix?

if (cols * lines < 10000)


//then 1 thread do the work...this sucks :D


the problem i am facing is that for 1 thread is time X, for 8 threads is time X / 8, for 80 threads is X / 4. What I don't understand is why the cores are at 100% usage, when the time is getting bigger. What are they doing?

They run the exact same code, so I guess it's the overhead of these extra threads. But shouldn't this cause the processor to not be an 100% usage?

Man, they only have 40 cores available, so if you run your code with 80 threads, you'll get large overheads with thread context swtiching.

Actually, if you run "cat /proc/cpuinfo" you'll see that you have 80 cpus total (40 cores with hyperthreading). And if that hadn't been the case, the rule of thumb is usually to run a parallel program with the number of threads equal to double the amount of physical cores you have.

Ontopic: You have to do manual benchmarking and adjust those limits so you always get a minimal time. The reason why you don't get linear speedup is based on three major things:
1. The thread creation and management overhead (+ synchronization if you have any)
2. The fact that the memory doesn't have parallel access
3. But most important of all is Amdahl's law.


Thanks for the reply.

1) The threads are created once, so i think there is no overhead in creating, and there are no critical/atomic regions. So the threads should run independently one of another.

3) I was looking only at the parallelized section of code. That is what I was bechmarking.

2) Most probably this is the one. But what I don't understand is why do the cores run at 100%? Since they are waiting for the memory to get them the data, shouldn't they be at a lower cpu % usage?

can you tell me what's your file format of a matrix ?


For very small datasets, you don't have to use 40 cores if you think it's not worth it.
The number of cores to use must be understood as themaximum or worker threads to launch.

Regards, paul


Thanks for replying.

@soft_parallel - what do you mean by file format? do you mean matrix size?

@Paul Guermonprez

I agree with that. But my problem is that doing something like:

if (matrix size is small)


then use small nr of threads




use more threads


it's not the best way to do it. It is too "core power" oriented. If I run the app on a lower speed CPU, there I may need 3 threads instead of 1 to do the job fast. so limiting the thread count by hand is not the way i think this problem should be solved. that's why I ask if there is some #pragma omp command that would take this decision from me. I couldn't find anything that might help.

And this issue did not appear only on a 5 / 5 matrix. it appeared on a higher matrix, which I may even consider quite large.

making the decision of optimal operations/thread on the MTL processor wouldn't be that much of a problem...a night spent testing and we have the result we need. but that means that only 1 processor will be guaranteed to work as fast as possible. and from how I see the problem, it's not a good solution.


And my problem is the same...why are the cores at 100%?? what could cause this? only waiting for memory operations?

Hello again, here's a quick example, with no synchronization, no nothing:

int main(){
        int i;
        int *conc_var = malloc(sizeof(int));
        #pragma omp parallel
        printf("%d ",*conc_var);
        return 0;


If you run this with OMP_NUM_THREADS=1, it runs instantly, but with OMP_NUM_THREADS=100 it takes a lot on the MTL. Why? Because of the thread overhead and concurrent access to the same address. Practically, the cpus are using "busy-waiting", that's why they're always at 100%.
Is this what you were after ?


So the input parameter you set is the maximum number of threads our program can start?

Also, is there a link to some manual where can I see how does this system work? ( how does it arrange threads over cores, in which order, where it allocates memory etc. )


Openmp is easy but you still have to follow
a basic training to use it properly.

In your case you use "parallel", not "parallel for"
it means everything will be executed several timeS.
If you launch with one thread, you'll have 1000000 iterations
If you launch with 10 threads, you'll have 1000000 * 10iterations, not 1000000 iterations spreaded on 10 threads as you could expect.

And i'm not even talking about shared variables ....

the videos are available here, including openmp
and the slides :

regards, paul

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui