I am comparing different shared-memory programming models on the Xeon-Phi. What I get as a result of comparison makes it a bit hard for justification.
I am doing the same computation (say Matrix Multiplication), and ..
1- for approaches in Group1, I get smaller CPI Rate, but a bit larger elapsed time.
2- For the programming models in Group2, I get higher CPI rate, but slightly better elapsed time.
When I went into the details, I realised that the number of Instructions Executed is another big difference. So I interpret the results like this:
Group1 models try to utilise the caches. For that purpose they execute more instructions, and as a result the average CPI rate is improved a bit. But it sometimes results in larger elapsed time as well.
Group2 that have smaller number of Instructions Executed, would have higher CPI rate (possibly as a result of more cache misses), but sometimes can run the program faster than (or at least as fast as) the approaches in the Group1.
Do you think it is a valid conclusion? What else can cause higher number of Instructions Executed?
Thanks in advance!