I am trying to parallelize an ITK application. The application has a very promiment loop with 50 iterations. I parallelized this loop using parallel_reduce() and ran the parallelized application on a 8-core machine. I noticed that a grain size of 1 is giving a better performance larger grain sizes. I am puzzled at this and wondering how this could be possible or under which condition this happens. Can anybody shed some light?