Parallel strands are not able to run in parallel if they concurrently attempt to access a shared lock. In some programs, locks can eliminate virtually all of the performance benefit of parallelism. In extreme cases, such programs can even run significantly slower than the corresponding single-processor serial program. Consider using a reducer if possible.
Parallel programs have numerous additional performance considerations and opportunities for tuning and improvement.
In general, the Intel® Cilk™ Plus runtime uses processor resources efficiently using a scheduling algorithm called work stealing. The work stealing algorithm is designed to minimize the number of times that work is moved from one processor to another.
You should measure performance to find and understand bottlenecks. Even small changes in a program can lead to large and sometimes surprising performance differences. The only reliable way to tune performance is to measure frequently—preferably on a mix of different systems. Use any tool or technique at your disposal, but only true measurements will determine if your optimizations are effective.