English | 中文 | Русский | Français
2,595 Posts served
8,341 Conversations started
Tech Project #1: Utilizing Genetic Algorithms to Identify Potential Software Performance Opportunities
In these blogs, I would like to discuss some of our failed technical projects at Intel in order to share some of the lessons we have learned. I am a believer that you learn just as much from your failures as from successes…so in these projects we learned a lot. My software group encourages us to take technical projects because we believe that some will lead to technical break-throughs. The ones I will describe in this blog are not the technical projects which worked and have been applied anywhere but instead the projects which have already been finished for several years and we decided not to pursue for one reason or another. I have saved the lessons learned until the end of the blog so that readers must read through the entire blog to get the take-aways…or they could just scroll down to the bottom. :)
This tech project ended back in 2004 and was absolutely fascinating. A colleague of mine, Jantz Tran, and I decided that genetic algorithms might help us unlock some of the potential of the Itanium architecture. Some of the features of Itanium are more difficult for the compiler to uncover since there are many complex factors of influence involved. For those of you not familiar with genetic algorithms, they are fairly useful in these types of situations. Use of genetic algorithms involves using competition between randomly created populations over many generations to simulate learning. Similar to Darwin’s “survival of the fittest” concept, the binaries are meant to get better over multiple iterations. Our implementation involved applying random small changes to several binaries and then we had them compete in performance. Over multiple iterations, “mutations” which improve performance are found and integrated into the binary. Once the software stopped improving in performance, the gains were reverse engineered by a program developed by Jantz accomplishing a binary search so that we knew what changes generated the performance gains. The random changes we made can affect the performance of the binary but we applied them to not affect the functionality to ensure instability is not introduced in this process. We were making too many changes to introduce instability and still have binaries which were still viable.
The parent and newly created children binaries were then compared in a bout of performance. The binary which showed the best performance on a particular workload becomes the parent of the next iteration while all other binaries were destroyed. I have created a wonderful block picture to again demonstrate the process. Please notice the storm cloud representing conflict between the binaries.
One example of the random changes which was applied was a temporal locality hint which can be applied to loads or stores on the Itanium architecture. Temporal locality hints give a hint to the processor which processor cache levels to brings the loads into and whether or not the Least Recently Used (LRU) will be incremented or not. Not incrementing the LRU can decrease the residence time that the line containing the address remains within the on-die cache. This hint can make the cache more efficient if none of the data contained in the cache line is used in the near future. Temporal locality hints can only impact performance but cannot impact the functionality of the program. The Itanium load before (a) and after (b) a temporal locality hit is applied is demonstrated below:
a) ld8 [r23] // Before temporal locality hint is applied as a mutation
b) ld8.nta [r23] // After a temporal locality hint is applied as a mutation
If this change increases the performance of the binary then that binary will win in the next contest and become the new parent binary…moved to the next round. Other examples changes which we experimented with are listed below. Most are of these are just “hints” to the architecture:
1) Temporal locality hints on loads - none, nta, or nt1
2) Branch prefetch hints - .few or .many
3) Branch taken/not taken hints - tk, nt
4) Line prefetch hints randomly place in with lfetch “address”
5) Temporal locality hints on stores – none, nta, or nt1
6) Data/control speculation
To determine whether we were missing any opportunities on performance we tried it on a simple gzip workload utilizing fully optimized binaries by profile guided optimizations etc. The end result of this experiment was a gain of 3% in the application performance over ~140 generations with 4 child binaries created for each generation.
Our framework for accomplishing this task was affectionately name "EVA" for evolution. "EVA" was in reality eva.pl, a massive 10k line perl script which would run and optimize the binary with no human intervention. The program was designed to create all children in parallel. I forget the exact time required for convergence but it was around 2 full days. Increasing the number of children created each generation, caused the genetics algorithm to converge faster so you can easily see how this could be potentially massively parallelized. In our runs we were serialized by running each binary by itself on the system to test performance. This was not done in parallel so that all the binaries running would not perturb each others performance.
Below is a graph of the performance of the time it took to run the gzip workloads over ~140 generations. Lower time is of course better in this case. Notice that the performance of the gzip binary improves over generations of the generated binaries.
We then accomplished a binary search on the binary to remove our tracked changes in order to reverse engineer the gains of the final binary. Each increase in performance gain was attributed to one or more changes. In the diagram below you can see each increase in performance on the gzip workload labeled. The following table has the attributed the increases in performance to changes in the assembly.
The surprising find during this experiment was that temporal locality hints were responsible for all of the gains on the gzip binary. This is one of the “aha” moments that I described at the beginning of the blog. Temporal locality hints are fairly useful on the Itanium architecture, it is difficult to determine where to place them unless a savvy developer is familiar with how often data within his or her program is touched.
The final conclusions on this project were that temporal locality hints look like they can give a binary, which has already been fully optimized by a good compiler, a nice performance boost. One other thing to note in this project is that we could have used genetic algorithms in a better manner. Leaving out cross linking, for instance, was one major mistake. Leaving this functionality out of genetic algorithms made us susceptible to local minima instead of converging on the best possible performance.
We were also never able to get the program to properly learn from its successes and mistakes. Several attempts were made at accomplishing this through giving the program incentives to more frequently apply optimizations which gave performance gains and decrease situations which lost performance. All attempts at learning always caused us to converge slower rather than faster. I also believe that on a non-fully optimized binary we would have seen much more gain from the branch hints, but genetic algorithms is hardly a efficient way of finding opportunities for branch hints. We also did have a couple of experiments where random software prefetches gave gains in performance on other binaries and workloads. These prefetches were accomplished on real loads through stepping back ~100 instructions or so in the path of execution but the mass majority of gains was always from temporal locality hints.
We are still looking for more opportunities to solve problems utilizing genetic algorithms. Perhaps one day my job as a software performance engineer will become eclipsed since a program, such as the one described in this blog, will be able to accomplish complex optimizations on software with no human intervention. Until that day, I will continue to play with some of these cool and unique methodologies to find more creative ways of analyzing and increasing software performance.
