I recently had a question from a customer who had introduced a succesful optimization to a hot function in his application, but did not see as much improvement in the overall application as he expected. This is a fairly common occurence in the iterative process of performance tuning. Usually it happens for one of two reasons.
1. Introducing an improvement in one area resulted in inefficiencies somewhere else. This is par for the course with performance tuning, and part of the reason why the process is iterative. It can be hard to anticipate whether a code change you are making in one function will decrease performance somewhere else down the road, and so landing in this situation from time to time is unavoidable. Although you may not be able to always prevent it, using good documentation practices and a tool like Intel® VTune™ Amplifier XE to quantify performance changes can help you see when it is happening.
2. Not enough of the application was optimized (or the optimization was not enough). Fortunately there is a way to predict whether this situation might occur, using Amdahl's Law. Amdahl's Law is used a lot for performance work, especially for projecting the theoretical scalability of parallel applications on a given problem size. But another very helpful way to apply the law is during the tuning process. It gives a formula for seeing how much potential overall improvement you will get from improving a fraction of the application. The formula is:
Speedup = 1/ ( (1-P) + P/S), where P is the portion of code improved, and S is the speedup from that portion
To use the formula you need to know what percentage of the application’s overall time is being devoted to the function you are improving. You can determine this using the "Hotspots" or "Lightweight Hotspots" analysis in VTune Amplifier XE. These analysis types give you a breakdown of the functions called in your application, and how much time each took. To figure out the percentage, you would look at the CPU_CLK_UNHALTED (clockticks) value for the function you want to improve, then divide by the overall time (one way to get this is to highlight all the rows from your application, then look at the bottom to see the total CPU_CLK_UNHALTED value). Then you would need to estimate how much you think you can improve the performance of that function. If, for example, you are vectorizing a loop in the function, you can use the speedup you expect from vectorization, which you can compute from the size of your data and the size of the SIMD registers available.
So, for example, if you had a function taking 20% of the total application time, and you were planning to improve it by 8x from vectorization, you could use the Amdahl’s formula to compute the maximum potential improvement you could achieve for the overall application:
1 / ((1-.2) + .2/8) = 1.21x theoretical maximum application speedup
Using Amdahl's Law in this way can help you compare options and determine where to spend your tuning effort. It is from Amdahl's Law that Hotspots-based tuning is derived. The formula would tell us that the payoff is going to be insignificant when you tune parts of your application that are not taking a significant fraction of total CPU time. So - Always tune in your hotspots, and use Amdahl's formula to maximize your efforts!