When optimizing your code for parallel hardware, consider using the following iterative approach:
Ignore the top two elements if you are not running on a cluster. There is not a recommended start point what to optimize first as this may vary. Pop up a level, look at all the potential optimizations and see where you can get the biggest gain for the least work. That is where you want to start.
Use these Intel performance analysis tools for the performance optimization workflow:
Explore available performance analysis and tuning scenarios with