Simulation without a Simulator

Submit New Article

June 19, 2009 12:00 AM PDT


by Sergey N. Zheltov and Stanislav V. Bratanov


Introduction

This paper gives an overview of an automatic code modification and analysis technology designed to assist in architectural research.

Designing new processor architectures is a very time-consuming, labor-intensive, and expensive task. First, preliminary research is required to form a general concept of a new architecture long before simulators or pre-production samples are ready. Then, initial design concepts resulting from the preliminary research must be proved before making any decisions about what to simulate.

This underscores the importance of estimating essential characteristics when designing an architecture—before beginning the real work on simulation.

In addition, many architectural concepts may be determined by a particular problem to be solved, and the goal of a new architecture may be formulated in terms of making existing software or theoretical algorithms applicable to real-time computation.

To assist in solving the complex task of estimating various architectural parameters related to a particular algorithm, the technology of automatic code modification and analysis (Code Processor) was developed.

Further in this paper the technology is described in detail, with linked code samples, and illustrated by an example of a high definition video encoder.


Suggested Solution

The need to analyze separate applications dictated the general idea of the solution—the simulating system must isolate an application of interest from the rest of the system. That is why we decided to work with the source code of the application and try to intercept all cases where the application being analyzed referenced a resource being simulated.

The above considerations were implemented as a set of tools including an assembly language analyzer, several simulator modules, and corresponding instrumentation modules.

The assembly language analyzer, the core of the technology, can be configured to intercept any instruction (instruction group, assembly construct, etc.) that may be of interest to a higher-level simulation module.

Figure 1 provides an overview of the instrumentation process and a general idea of how the technology operates.

First, a source code is compiled into an assembly code, which is then automatically analyzed and instrumented in accordance with the needs of a simulating module. In the example, the instrumentation includes the following steps: the context of execution is saved before each memory access, a callback function (named according to the simulator module used) is called with the parameters describing the memory access operation, and the execution context is restored. Finally, the memory access in question is executed.

Figure 1. Technology Overview

Figure 1. Technology Overview

Once the assembly file has been instrumented, it is further compiled into an object file, and then an executable module is built.

The resulting ex ecutable, now ready for simulation, may be either statically linked with a simulator module to form a single simulating package, or may require the simulator as an external dynamic link library.

As depicted in Figure 1, during execution the callback function we specified earlier at the instrumentation stage is invoked, and receives all necessary parameters describing the simulated operation (e.g., address, operand type and size, access type–read or write). The callback function has an option of calling a simulator module directly, performing pre-simulation computations, or saving necessary parameters in a file for further analysis.

The results of the simulation are available any time during program execution from the simulator module in use. This enables dynamic control over the process of simulation.

To clarify why this technology is easier to use than a 'normal' simulator and why a high-level programmer should bother with assembly instructions, Figure 2 is furnished to show what it really looks like to instrument one’s code.

Figure 2. User Interface Additions to a Normal Build Procedure

Figure 2. User Interface Additions to a Normal Build Procedure

Just choose Code Processor from the menu as if you were selecting a compiler, check the ‘Use parser-passed compilation’ checkbox, and select a desired instrumentation module.


Simulator Modules

While currently there are only two simulator modules implemented, in the authors’ opinion these modules are of primary importance. We determined this by asking the question: What makes a processor run slower? And the ‘effective’ means to slow down a processor’s performance were identified:

  • memory accesses that missed a cache;
  • mispredicted branches which cause pipeline flushes and execution stalls;
  • incorrect synchronization or concurrent resource usage by multiple threads.

So, we have:

  • thread-aware cache-coherent memory subsystem simulator module, and
  • branch prediction simulator module.

 

A more detailed description of each module follows.


Figure 3. A Sample Cache Hierarchy

Figure 3. A Sample Cache Hierarchy


Cache/Memory Subsystem

Currently implemented cache simulator module may support an arbitrary cache/memory configuration.

It also computes:

  • traffic in each direction for a cache (read/write operations);
  • cache line hits/misses;
  • information on split or unaligned operations.

 

In order to maintain complex cache systems for multiprocessor machines a bus object is introduced to simulate:

  • links between caches;
  • links between cores in multi-core architectures;
  • links between different units in a processo r;
  • traffic in each direction for a bus object.

 

The cache simulator module is thread-safe and maintains cache coherency when simulating multi-core architectures.

An example of a cache system is depicted in Figure 3.

For an example of programming, please refer to the code sample.


Branch Prediction

The branch prediction module is capable of simulating different branch prediction schemes and algorithms.

There are two algorithms implemented so far:

  • static BTFNT – a Backward branch is predicted as Taken, Forward as Not Taken;
  • and dynamic SPD – Simple Pattern Detection.

 

The former algorithm is widely used in modern processors to reduce the number of mispredicted branches. As is shown later in this article, the static prediction algorithm works very well in most cases.

The latter method concerns remembering the history of taken and not taken branches and applying this knowledge to the prediction of the next branch.

The branch prediction module provides a caller with the following information:

  • the number of predicted and mispredicted branches;
  • the number of conditional and unconditional branches;
  • the number of forward and backward branches;
  • detailed branch prediction statistics for each module and function.

 

Please see the code sample to learn more on how to use it.


Advantages

Here we list the advantages of both an architectural simulator and the suggested preliminary simulation technology to let readers decide what best suits their needs.

Advantages of having an architectural simulator:

  • No need to build a hardware system
  • Accurate results
  • The entire execution environment is simulated

Disadvantages:

  • It is often necessary to port the target application to a new architecture
  • Sometimes simulators impose restrictions on applications, e.g., limit the amount of available memory, I/O operations, etc.
  • It usually takes a lot of time and resources to develop a complete set of utility programs along with a simulator (at least a compiler + a simulator)

Code Processor technology advantages:

  • Uses existing compilers or assemblers for current  architectures
  • Fast simulation process
  • Application-oriented
  • An application being analyzed has no additional limitations (e.g., I/O, memory, etc.)
  • Extensible design – new instructions or instruction sets, new registers and register types simulation modules may be added independently, as well as modules to simulate power consumption, bus utilization, or any other system characteristics.

 


Results Achieved

The goal of our analysis was to find and approximate such architectural parameters that might let a computer system encode MPEG-2 HDTV video in the most efficient manner.

We selected the following application for our experiments:

MPEG-2 Video Encoder, HDTV resolution (1920x1088), 1 second duration, 30 frames, employing Logarithmic Motion Estimation algorithm.

During the experiment the following architectural parameters were estimated:

Cache hierarchy parameters:

  • number of cache levels (L1 + L2);
  • each cache’s associativity (4 through 32);
  • each cache’s size (1K through 128M);
  • cache line size (64 bytes).

Register file:

  • number of registers (16, 32, 64);
  • register length (8, 16, 32 bytes).

Branch prediction unit:

  • branch prediction ratio for different modules of the encoder.

 


Caches

Using a cache minimizes the traffic from a processor to memory.

In Figure 4, there are two flat regions in the graph of memory traffic. They correspond to efficient sizes of L1 and L2 caches. Choosing any cache size value within 64K-4M and 32M-128M keeps memory traffic at a steady and predictable level.


Figure 4. Simlated Cache System Parameters

Figure 4. Simulated Cache System Parameters


Register File Simulation

In order to estimate register file parameters, information on all memory operations was saved and processed in a manner that would enable the efficient use of a certain set of registers. The register sets differed in register length and number. At each point of the processing algorithm a window of memory operations was analyzed, and the most frequently accessed memory addresses within the window were stored in registers. This approach would allow for lower memory traffic for more efficient register utilization. Hence, similar to the case of cache analysis, register file parameters that minimize the traffic should be considered the most efficient.


Figure 5. Simulated Register File Parameters

Figure 5. Simulated Register File Parameters

As shown in Figure 5, more registers are better for the MPEG encoder, though increasing the register size does not necessarily lead to positive results.


Branches

The goal of simulation was to find the most mispredictable part of the encoder and check whether new prediction algorithms may help.

Figure 6 provides the percentage of predicted branches. The static prediction algorithm works well, with roughly 10 per cent left for further improvement. This can be explained by the optimization performed by a compiler that generates branch instruction in accordance with the static prediction method.


Figure 6. Simulated Branch Prediction Results

Figure 6. Simulated Branch Prediction Results


Conclusion

Building a real simulator at the preliminary design stage, when most operational characteristics of a new device to simulate are not yet clear, is considered too expensive.

One possible and relatively inexpensive solution to conduct pre-simulation research would be an automatic source code processing technique combined with specific software simulator modules, easily adjustable and extensible to incorporate more parameters and reflect any changes in architectural specifications.

As one of the goals of architectural research is to estimate the effect a particular architectural feature may have on the performance of current software products, the approach to the architectural analysis introduced in the article may appear extremely beneficial  because of both the flexibility of simulated characteristics and the ability to perform the analysis on a per-application basis, thus isolating a program of interest from the rest of the operating environment and excluding unnecessary information.

The results of such a preliminary analysis are valuable for making final design decisions.


About the Authors

Sergey Zheltov is a project manager in Microprocessor Research, Intel Labs. His research interests include media compression and processing, software and platforms architecture, signal processing, high-order spectra. He received a Diploma in radio-physical engineering and MS degree in theoretical and mathematical physics from Nizhny Novgorod State University.






Stanislav Bratanov is a software engineer in Microprocessor Research, Intel Labs. His research interests include multi-processor software platforms, operating system environments, and platform-dependent media data coding. He graduated from Nizhny Novgorod State University, Russia.








References
  • http://www.sics.se/simics/publications.html
  • http://www.virtutech.com/
  • J. Marathe et al. Detailed Cache Coherence Characterization for OpenMP Benchmarks.
    In materials of ICS'04, June 26–July 1, 2004.
  • M. Rosenblum et al. Using the SimOS Machine Simulator to Study Complex Computer Systems.
    ACM Transactions on Modeling and Computer Simulation, Vol. 7, No. 1, January 1997, Pages 78–103.
  • L. DeRose et al. SIGMA: A Simulator Infrastructure to Guide Memory Analysis.