There is a lot of overhead in OpenMP. It will run slower unless the loop takes a significant fraction of a second single-threaded. Tens of milliseconds, minimum.
Also, if the task if memory bandwidth bound rather than compute or cache bound, it will run no faster parallelized, regardless of the API used. At least on most single-socket hardware. You have to know your system architecture here.
Thanks for the feedback. Your comments provide me something to work on. I can easily construct a test case where I gradually increase the time of the inner loops so that I can evaluate the effect of the overhead. Knowing about the overhead will also provide a better base for evaluating other parts of our software which could be parallelized.