Artificial Neural Nets and Hyper-Threading Technology

by Chuck Desylva


Different methods for optimizing AI algorithms to take advantage of an Intel® Pentium® 4 Processor with Hyper-Threading Technology

The purpose of this paper is to highlight several key artificial intelligence (AI) software technologies and some simple changes that can be made to them to gain performance improvements on the Pentium® 4 and Intel® Xeon® processors.

Neural Architectures in Software

There are many AI architectures in use. They commonly lend themselves to a variety of application uses: game AI, image compression, speech recognition, signal processing, finance algorithms (for speculative based processing), and data mining. The type of Artificial Intelligence algorithm addressed in this paper is called an Artificial Neural Network, or ANN for short.

There are many sources of information regarding ANNs and their use. My objective is to illustrate ANN optimization on Intel® hardware. Since ANN architectures typically have tremendous potential for parallelism, my focus in this article is enhancing performance of an ANN using Hyper-Threading Technology (HT technology).

Artificial Neural Nets (a brief overview)

Before we get into source code optimizations and their significance, it's important to understand some basics about this architecture.

What is a Neural Network? Simply stated, it is a network of very simple processors (where each processor may have a small amount of memory). These simple processors connect to each other by unidirectional communication paths and typically carry symbolic data. See, Figure 1, which shows binary data being used in an artificial neuron in ways analogous to the symbolic functioning of the biological brain. Though many learning methods exist, 'back-propagation of error,' the most common type of method, is used here. With this method, corrective adjustments on the neural net are mediated by back-propagating error signals from one neuron to those above it. In this way, the net moves closer to the correct result. Also, the topology used in these examples is that of a simple feed-forward net, without any recursive paths (thus creating small memories) in them.

Figure 1. Biological to Artificial Neural Mapping Concept

It is also important to note that while ANNs can typically compute any type of function, in practice they are especially useful:

  • For mapping problems that are tolerant of some error, and
  • For problems where lots of example data is available and hard and fast algorithmic rules are not easily applied.


ANNs today are not easily applied to problems concerning manipulation of symbols and memory. It is left to the reader to further research concepts of Back Propagation, Overfitting and other ANN-specific terms, including, for example, Bias Inputs.

ANN Code Optimization - Preliminaries

For the most part, code used in these experiments was obtained either from* or the book "AI Techniques for Game Programming" by Mat Buckland and Andre' LaMothe.

The code was tested only on machines with Microsoft* Windows* XP Professional with Service Pack 1. This OS supports HT technology (with a special Kernel .Dll's installed). Performance was measured by using the processor's RDTSC instruction.

It is assumed that the reader is familiar with HT technology and how it operates within the processor. More information on this technology is available at Pentium® 4 Developer Center, Hyper-Threading Technology.

Knowledge of OpenMP* is desirable. For more information on OpenMP please refer to the reference material at*. The Intel® C/C++ Compiler (version 7.0), which supports OpenMP, was used in examples below.

Opportunities to enhance performance using Streaming SIMD Extensions (SSE) and Streaming SIMD Extensions 2 (SSE2) are somewhat limited because ANNs rely on a simple Feed-forward algorithm computed for each node in the net.

Taking Advantage of HT technology

Let's turn now to performance enhancements that can be gained by multithreading the algorithm for HT technology.

Regardless of programming language used, an ANN will have at its core an update routine; a routine that sums the outputs and passes that sum to an activation function which

  • Propagates the output trigger for the neuron, and
  • Back propagates the computed error to the parent nodes (artificial neurons). This simple function is shown in figure #2 (in a C/C++ representation).


Figure 2. Simple ANN Neural update function.

Initial testing of this algorithm was done for 10 million artificial neurons (approximately 1 ten-thousandth the number in the average human brain). Performance was increased 10% by using the Intel® lib_sse2_exp function (in the short vector math library svml_disp.lib) instead of the default exp function (in the default C math library1libc.lib), and compiling using the Intel® C/C++ Compiler with /G7 and /QaxW switches.

Handcoding the exp call to the Intel® Approximate Math Library AMATHS.LIB (/sites/default/files/m/c/a/3/ yielded an additional performance gain of 3%. However, since values computed in the function are all 32-bit single precision floating point, some accuracy (5th position after the decimal typically) was sacrificed with this method. While accuracy in many cases is not a problem, individual performance/accuracy tradeoffs must be evaluated.

In most cases, our examples use the Microsoft Standard template Library (STL) that ships with Microsoft* Visual Studio*. The first application called 'Tiddlywinks' can be downloaded from* >> "Genetic Algorithms". There are several problems with this application, the first being that STL with Visual Studio 6.0* is thread-safe. In order to change this, it is recommended that 11 Headers and 1 source file be replaced with thread-safe versions available at*.

VTune® analyzer shows that changing these core components will provide a performance benefit of ~4.3X in kmp_wait and ~3X in kmp_x86_pause commands which OpenMP uses to synchronize threads (Figure 3.).

Figure 3. Synchronization Latencies after Dinkumware STL Patch.

With a patched STL, the code was modified with OpenMP pragmas (directives which tell the compiler to insert threading constructs in the source code). Starting from the coarsest level of granularity in the code, the first function for threading potential is WinMain. However, this cannot be threaded with OpenMP due to the single (non-thread shared) hWnd dependency between the Message Passing functions (PeekMessage, TranslateMessage and DispatchMessage) and the Rendering block. The next best choice is to go down a level and thread the Genetic Algorithm's Epoch function (called from WinMain). The method chosen was to use OpenMP sections (see Figure 4.).

Figure 4. OpenMP Threading of Genetic Algorithm "Epoch" Function.

If you're not familiar with OpenMP, you probably won't understand the significance of the "" statement. This statement tells the compiler to distribute the now split for-loop into two threads of execution.

The performance increase for the algorithm is based on the number of genetic updates completed per 50 generations (See Figure 5.). This resulted in approximately a 20% performance increase against the original unmodified source. Testing the OpenMP threaded applet was done on a 3.06 GHz Pentium® 4 processor platform with 512 MB RMBS memory (Reference platform). HT technology was initially enabled with the OpenMP thread paths in place and then was disabled by removing code paths for OpenMP threading.

Figure 5. Neural Update (Epoch's Completed) time speedup

This same method of optimization was used in another applet called "Smart Sweepers", also obtained from* >> "Genetic Algorithms". Once again, modify the Neural update function (called CController::Update() in this case) as before. Competitive scaling results of the OpenMP sectioned Neural Update function (see Figure 6.) against the same unmodified source yielded approx imately a 2X performance boost on the Intel® reference platform with HT technology enabled.

Figure 6. OpenMP Sectioned Neural Update Function for Smart Sweepers Applet.

Figure 7. Competitive Performance Scaling using OpenMP with HT technology.


1 Default library used was from Microsoft VC++ 6.0 Compiler (ver. 12.00.8804) with Service Pack 5 and Processor Pack installed.

Optimization of the Back Propagation Algorithm

Next let's look at optimizing the ANN by making changes to the back propagation algorithm itself, as opposed to adding additional functionality. Our focus is the sigmoid function because it gets called for every neuronal update. There are a number of sigmoid functions in use, but for the cases of performance and accuracy only a finite number will suffice. In this case, the best fit was to use linear approximation to avoid significant loss of learning characteristics to the network and at the same time provide a significant performance boost. Figure 8 lists the commented C/C++ source of the original function call, which uses the exponent transcendental and applies the linear function.

Figure 8. Linear Approximated (Fast) Sigmoid Function

Using this method in another applet called "Neural Ants" (where artificial ants with small nets run around consuming food) a 3.6X performance gain was obtained over the original sigmoid utilizing the exp function call. Further, utilizing this method not only improved Neural Update Quanta, but improved the fitness of the "best" nets (Ants) in the system.


In this article we touched on the basics of Artificial Neural Net algorithms (ANNs), their parallel nature and considered the advantages they offer in various software solutions. ANNs and Hyper-Threading Technology are indeed ideally suited for each other. By functionally decomposing ANN work, dividing it among logical processors and employing additional optimization methods, you can achieve a significant performance gains.

Related Resources




Developer Centers

About the Author

Chuck Desylva is a Software Applications Engineering Manager in the Intel Software and Solutions Group. He and his team are responsible for the performance optimization of cutting edge consumer software titles running on Intel Desktop systems. Prior to working on application software optimization, Chuck worked as a driver developer for Intel Corporation. He was involved in developing/deploying the first device drivers for USB, AGP (GART) and Intel’s first graphics devices (i740/810(e)).

For more complete information about compiler optimizations, see our Optimization Notice.