I am developing a neural network package using Intel C++ compiler 9.0. The code is so parallel that it is a no brainer to use OpenMP. The problem is to know when not to use it.
Most of my code is vector operations (dot product, vector add, scaling etc.). What I am looking for is some guidance as to when it becomes detrimental to parallelize - for example, it is probably worth parallelizing A.B if dimensionality of vectors A & B is 10^6. But should I parallelize such loops when I expect the typical dimensionality to be 100 or 1000 or 10000?
I would appreciate it if anyone can provide guidance on this.
Btw, I noticed (after spending 2 days tearing my hair out :) that Intel optimizer (/O3 /Qip) does not perform scalar replacement in loops nested inside parallelized loops - it had really slowed my application down.
Also, is to to be expected that parallelized loops are not vectorized? I would have thought it should be possible with static scheduling at least.