I have written a program which is fully vectorized with Intel Intrinsics and parallelized with OpenMP pragmas. I reach quiet good performance (300 GFLOPs/s). Now I have tried to implement my Intrinsics code with POSIX Threads for all 240 HW threads, but my performance is quiet bad now (15 GFLOPs/s). I used both pthread_setaffinity_np and sched_setaffinity, but neither gives good performance. In the attachment is a code sample (for some reasons my code gets blocked by the spam filter if I post it here). I have also added a short Vtune analysis, which looks quiet strange (the kernel starts at 47s and is called dslash and their threads p16_run)! The Kernel has the same structure as in my OpenMP code. I just replaced all pointers with pointers from the input struct for each POSIX Thread.
My question is: Did I allocate the CPU masks with the cpu_set_t commands in the right way? Why is my performance so bad?