| Last Modified On : | July 26, 2007 11:06 AM PDT |
Rate |
|
by Chuck Desylva
Different methods for optimizing AI algorithms to take advantage of an Intel® Pentium® 4 Processor with Hyper-Threading Technology
The purpose of this paper is to highlight several key artificial intelligence (AI) software technologies and some simple changes that can be made to them to gain performance improvements on the Pentium® 4 and Intel® Xeon® processors.
There are many AI architectures in use. They commonly lend themselves to a variety of application uses: game AI, image compression, speech recognition, signal processing, finance algorithms (for speculative based processing), and data mining. The type of Artificial Intelligence algorithm addressed in this paper is called an Artificial Neural Network, or ANN for short.
There are many sources of information regarding ANNs and their use. My objective is to illustrate ANN optimization on Intel® hardware. Since ANN architectures typically have tremendous potential for parallelism, my focus in this article is enhancing performance of an ANN using Hyper-Threading Technology (HT technology).
Before we get into source code optimizations and their significance, it's important to understand some basics about this architecture.
What is a Neural Network? Simply stated, it is a network of very simple processors (where each processor may have a small amount of memory). These simple processors connect to each other by unidirectional communication paths and typically carry symbolic data. See, Figure 1, which shows binary data being used in an artificial neuron in ways analogous to the symbolic functioning of the biological brain. Though many learning methods exist, 'back-propagation of error,' the most common type of method, is used here. With this method, corrective adjustments on the neural net are mediated by back-propagating error signals from one neuron to those above it. In this way, the net moves closer to the correct result. Also, the topology used in these examples is that of a simple feed-forward net, without any recursive paths (thus creating small memories) in them.
Figure 1. Biological to Artificial Neural Mapping Concept
It is also important to note that while ANNs can typically compute any type of function, in practice they are especially useful:
ANNs today are not easily applied to problems concerning manipulation of symbols and memory. It is left to the reader to further research concepts of Back Propagation, Overfitting and other ANN-specific terms, including, for example, Bias Inputs.
For the most part, code used in these experiments was obtained either from http://www.ai-junkie.com* or the book "AI Techniques for Game Programming" by Mat Buckland and Andre' LaMothe.
The code was tested only on machines with Microsoft* Windows* XP Professional with Service Pack 1. This OS supports HT technology (with a special Kernel .Dll's installed). Performance was measured by using the processor's RDTSC instruction.
It is assumed that the reader is familiar with HT technology and how it operates within the processor. More information on this technology is available at Pentium® 4 Developer Center, Hyper-Threading Technology.
Knowledge of OpenMP* is desirable. For more information on OpenMP please refer to the reference material at www.openmp.org*. The Intel® C/C++ Compiler (version 7.0), which supports OpenMP, was used in examples below.
Opportunities to enhance performance using Streaming SIMD Extensions (SSE) and Streaming SIMD Extensions 2 (SSE2) are somewhat limited because ANNs rely on a simple Feed-forward algorithm computed for each node in the net.
Let's turn now to performance enhancements that can be gained by multithreading the algorithm for HT technology.
Regardless of programming language used, an ANN will have at its core an update routine; a routine that sums the outputs and passes that sum to an activation function which
Figure 2. Simple ANN Neural update function.
Initial testing of this algorithm was done for 10 million artificial neurons (approximately 1 ten-thousandth the number in the average human brain). Performance was increased 10% by using the Intel® lib_sse2_exp function (in the short vector math library svml_disp.lib) instead of the default exp function (in the default C math library1libc.lib), and compiling using the Intel® C/C++ Compiler with /G7 and /QaxW switches.
Handcoding the exp call to the Intel® Approximate Math Library AMATHS.LIB (http://www.intel.com/design/Pentium4/devtools/) yielded an additional performance gain of 3%. However, since values computed in the function are all 32-bit single precision floating point, some accuracy (5th position after the decimal typically) was sacrificed with this method. While accuracy in many cases is not a problem, individual performance/accuracy tradeoffs must be evaluated.
In most cases, our examples use the Microsoft Standard template Library (STL) that ships with Microsoft* Visual Studio*. The first application called 'Tiddlywinks' can be downloaded from http://ai-junkie.com* >> "Genetic Algorithms". There are several problems with this application, the first being that STL with Visual Studio 6.0* is thread-safe. In order to change this, it is recommended that 11 Headers and 1 source file be replaced with thread-safe versions available at http://www.dinkumware.com/vc_fixes.html*.
VTune® analyzer shows that changing these core components will provide a performance benefit of ~4.3X in kmp_wait and ~3X in kmp_x86_pause commands which OpenMP uses to synchronize threads (Figure 3.).
Figure 3. Synchronization Latencies after Dinkumware STL Patch.
With a patched STL, the code was modified with OpenMP pragmas (directives which tell the compiler to insert threading constructs in the source code). Starting from the coarsest level of granularity in the code, the first function for threading potential is WinMain. However, this cannot be threaded with OpenMP due to the single (non-thread shared) hWnd dependency between the Message Passing functions (PeekMessage, TranslateMessage and DispatchMessage) and the Rendering block. The next best choice is to go down a level and thread the Genetic Algorithm's Epoch function (called from WinMain). The method chosen was to use OpenMP sections (see Figure 4.).
Figure 4. OpenMP Threading of Genetic Algorithm "Epoch" Function.
If you're not familiar with OpenMP, you probably won't understand the significance of the "" statement. This statement tells the compiler to distribute the now split for-loop into two threads of execution.
The performance increase for the algorithm is based on the number of genetic updates completed per 50 generations (See Figure 5.). This resulted in approximately a 20% performance increase against the original unmodified source. Testing the OpenMP threaded applet was done on a 3.06 GHz Pentium® 4 processor platform with 512 MB RMBS memory (Reference platform). HT technology was initially enabled with the OpenMP thread paths in place and then was disabled by removing code paths for OpenMP threading.
Figure 5. Neural Update (Epoch's Completed) time speedup
This same method of optimization was used in another applet called "Smart Sweepers", also obtained from http://ai-junkie.com* >> "Genetic Algorithms". Once again, modify the Neural update function (called CController::Update() in this case) as before. Competitive scaling results of the OpenMP sectioned Neural Update function (see Figure 6.) against the same unmodified source yielded approx imately a 2X performance boost on the Intel® reference platform with HT technology enabled.
Figure 6. OpenMP Sectioned Neural Update Function for Smart Sweepers Applet.
Figure 7. Competitive Performance Scaling using OpenMP with HT technology.
1 Default library used was from Microsoft VC++ 6.0 Compiler (ver. 12.00.8804) with Service Pack 5 and Processor Pack installed.
Next let's look at optimizing the ANN by making changes to the back propagation algorithm itself, as opposed to adding additional functionality. Our focus is the sigmoid function because it gets called for every neuronal update. There are a number of sigmoid functions in use, but for the cases of performance and accuracy only a finite number will suffice. In this case, the best fit was to use linear approximation to avoid significant loss of learning characteristics to the network and at the same time provide a significant performance boost. Figure 8 lists the commented C/C++ source of the original function call, which uses the exponent transcendental and applies the linear function.
Figure 8. Linear Approximated (Fast) Sigmoid Function
Using this method in another applet called "Neural Ants" (where artificial ants with small nets run around consuming food) a 3.6X performance gain was obtained over the original sigmoid utilizing the exp function call. Further, utilizing this method not only improved Neural Update Quanta, but improved the fitness of the "best" nets (Ants) in the system.
In this article we touched on the basics of Artificial Neural Net algorithms (ANNs), their parallel nature and considered the advantages they offer in various software solutions. ANNs and Hyper-Threading Technology are indeed ideally suited for each other. By functionally decomposing ANN work, dividing it among logical processors and employing additional optimization methods, you can achieve a significant performance gains.
Literature
Websites
Related Themes
Developer Centers
Chuck Desylva is a Software Applications Engineering Manager in the Intel Software and Solutions Group. He and his team are responsible for the performance optimization of cutting edge consumer software titles running on Intel Desktop systems. Prior to working on application software optimization, Chuck worked as a driver developer for Intel Corporation. He was involved in developing/deploying the first device drivers for USB, AGP (GART) and Intel’s first graphics devices (i740/810(e)).
