I am running a visualization program (visualizing a large dataset) where I can either use MPI or pthreads. When I run it on my desktop which has an Intel i7-2600K (4 cores, 8 threads), I get better performance using pThreads (I'm using a lot of threads, e.g 32) compared to using MPI which is normal (I guess). But when I run the same code on one node (which is part of a cluster) which has Intels Xeon E5-2680 v2 (10 cores, 20 threads), the performance I get using pthreads is worse than MPI; about 70s while using MPI compared to 180s using pthreads. Even worse, the performance on the Intel Xeon E5-2680 v2 is lower than on that of the Intel i7-2600K, it's around 100s on the 2600k but 180 on the E5-2680 (same number of threads on both). I check using the top command and all the cores are active when I run the program.
So my question is why is that happening? Is there some other way I should be compiling the code on the E5-2680?
Is there some variables I should set like KMP_AFFINITY or something else?
Any suggestions will be most welcome.