In the last example posted in this thread, I can't imagine why parallel sections would be used, rather than parallel do, nor why the inner loop would be designated for OpenMP parallel. If threaded parallelism is required without any thought given to optimization, /Qparallel would be preferable, even though still not often effective.
As to the minimum problem size for effective OpenMP parallel, I have an example which achieves excellent threaded scaling on Core 2 Duo, when the non-threaded version takes only 1 millisecond. Of course, this is an ideal case; the cache sharing is effective, as are the persistent threads left from a previous parallel region. The Intel OpenMP run-time does show a reduced overhead, compared with the Microsoft and gnu libraries.
The basic point, that OpenMP parallelism will not have an advantage for a simple inner loop of length 1000, does apply to the posted case.
Thanks for the feedback. I am new to the world of parallelization (obviously), so I will look at parallel do as well.