by Paul Klein and Zafer Kadi, Ph.D.
Introduction
SystemC*, a relatively new modeling language, provides an effective environment for simulation and modeling at different levels of abstraction. SystemC has become an integral part in many commercial EDA tools.
SystemC is based on the C++ language and has a number of language constructs that enable designers to model systems at various levels of accuracy. This paper focuses on improving simulation performance utilizing the Intel® compilers and performance tools. The measurements demonstrate the improvement in simulation performance. The analysis has shown that there are two primary areas on which to focus for optimal SystemC performance:
- Use the right methodology for modeling
- Take advantage of the Intel architecture using the Intel compiler.
If a modeler is concerned about the performance of simulations in EDA environments, the Intel compiler and related performance tools provide immediate and significant increase of simulation performance.
SystemC has evolved as a promising candidate for system performance modeling. Its ability to model at different levels of abstraction enables the use of modeling and simulation in every step of the design process:
- Capture design intent
- Evaluate performance
- Virtual prototypes for software development
- RTL IP co-simulation
- HW/SW co-simulation and verification.
SystemC provides a number of new constructs and hardware-oriented data types. However, the modeler must strike a balance between performance and modeling abstraction level when determining which of these constructs should be used.
Previous work has shown that higher GHz on Intel® Xeon® processors or Pentium 4 processors directly (linearly) relates to an increase in simulation performance. This paper shows that by utilizing Intel compilers, an additional performance advantage can be realized.
The Environment and the Testbench
We performed the measurements on an Intel Xeon processor running at 2.6 GHz, with 8MB of L3 cache and Linux* 2.4.x kernel. The compared compilers were: GNU* 2.95.3 and 3.3 (GCC) and Intel Compiler 7.1.015 (ICC).
The compiler flags were generic optimization flags, to be consistent with many EDA tool development environments. Additionally, utilizing architecture-based optimization would only improve the effectiveness of the ICC.
Our testbench consisted of test cases used to test our models written in SystemC for our modeling needs as well as optimize our Transaction Level Modeling (TLM) capability. The test cases are based on a highly efficient and proprietary TLM and traffic generators to exercise modeled systems and components. The TLM is written in C/C++ utilizing the Open SystemC Initiative (OSCI) library. The OSCI library was also compiled using ICC.
The models were written using SystemC and are basic HW building components, such as traffic generators, CPU models, bridges, slaves, and memory components.
PERF_TEST1 Test Description
This model contains a single master and single slave component. The master is a random traffic generator, and the slave is a generic memory device. These two components are linked using an Intel-developed TLM. The explicit transactions are delayed and generated randomly, without waiting for the previous transaction to complete.
Figure 1. PERF_TEST1 Block Diagram
PERF_TEST2 Test Description
This model was developed based on the PERF_TEST1 test case and uses two much simpler master and slave components to emphasize the implicit timing capability of the our TLM and the SystemC kernel. The implicit transactions are delayed and generated randomly, after the previous transaction is completed.
Figure 2. PERF_TEST2 Block Diagram
PERF_TEST3 Test Description
This model was also developed based on the PERF_TEST1 test case and uses two much simpler master and slave components to emphasize the explicit (more accurate) timing capability of the TLM and the SystemC scheduler.
Figure 3. PERF_TEST3 Block Diagram
PHASE0 Test Description
This model was used to boot Nucleus+* and WinCE* operating systems within a SystemC architecture model. This is a hierarchical model that contains a timed Intel Xscale® technology instruction set simulator core, with its own non-SystemC simulation scheduler (built using the Intel compiler). The model also includes a timer, an interrupt controller, a generic memory, and three of our TLM buses.
Figure 4. PHASE0 Block Diagram
PHASE1 Test Description
This model was used to boot Nucleus+ and WinCE operating systems within a SystemC architecture model similar to the PHASE0 test case. This is a hierarchical model that contains the timed Intel XScale® technology instruction set simulator core, a timer, an interrupt controller, a 1x1 bridge (a bridge with one in and one output port) and an external memory controller with memory and four of our TLM buses.
Figure 5. PHASE1 Block Diagram
Simulation Results
The results shown here were obtained with previously written code, developed under GNU or Microsoft Visual Studio*, with no modifications b ased on the chosen compiler. The first table focuses on the results for the first three performance test scenarios described in the “The Environment and the Testbench” section.
The different performance results consistently show a more than 25-30% improvement in performance.
| PERF_TEST1 | |||||
| Tool/Compiler | Average Sim Time (s) | Average Sim CPS (MHz) | Gain | Total TLM Events | Average Sim EPS (MHz) |
| OSCI 2.0.1/GNU 2.95.3 – O2 | 67.4 | 7.418 | 0.00% | 261,538,454 | 3.880 |
| OSCI 2.0.1/GNU 3.3 – O2 | 64.2 | 7.788 | 4.99% | 261,538,454 | 4.074 |
| OSCI 2.0.1/Intel 7.1 – O2 | 57 | 8.772 | 18.25% | 261,538,454 | 4.588 |
| OSCI 2.0.1/Intel 7.1 Advanced Optimizations | 55.6 | 8.993 | 21.23% | 261,538,454 | 4.704 |
| OSCI 2.0.1/Intel 7.1 Advanced Optimizations with PGO | 54.4 | 9.191 | 23.90% | 261,538,454 | 4.808 |
| PERF_TEST2 | |||||
| OSCI 2.0.1/GNU 2.95.3 – O2 | 78.6 | 6.361 | 0.00% | 374,999,995 | 4.771 |
| OSCI 2.0.1/GNU 3.3 – O2 | 75.2 | 6.649 | 4.53% | 374,999,995 | 4.987 |
| OSCI 2.0.1/Intel 7.1 – O2 | 67 | 7.463 | 17.32% | 374,999,995 | 5.597 |
| OSCI 2.0.1/Intel 7.1 Advanced Optimizations | 64.6 | 7.740 | 21.68% | 374,999,995 | 5.805 |
| OSCI 2.0.1/Intel 7.1 Advanced Optimizations with PGO | 61.8 | 8.091 | 27.20% | 374,999,995 | 6.068 |
| PERF_TEST3 | |||||
| OSCI 2.0.1/GNU 2.95.3 – O2 | 88.6 | 5.643 | 0.00% | 437,499,995 | 4.938 |
| OSCI 2.0.1/GNU 3.3 – O2 | 86.6 | 5.774 | 2.32% | 437,499,995 | 5.052 |
| OSCI 2.0.1/Intel 7.1 – O2 | 72.8 | 6.868 | 21.71% | 437,499,995 | 6.010 |
| OSCI 2.0.1/Intel 7.1 Advanced Optimizations | 70 | 7.143 | 26.58% | 437,499,995 | 6.250 |
| OSCI 2.0.1/Intel 7.1 Advanced Optimizations with PGO | 71.2 | 7.022 | 24.44% | 437,499,995 | 6.145 |
Table 1. PERF_TEST1, PERF_TEST2, PERF_TEST3 Results
The more complicated models, Phase0 and Phase1, a core-model library, already compiled using ICC during approximately 50% of the simulation time. The results is a more significant improvement for the SystemC components alone (~50%) than achieved in the PERF_x tests. The results show an overall simulation improvement while booting a WinCE image that has two processes executing after boot. WinCE is also configured with a 1ms timer interrupt.
| PHASE0 | |||
| Tool/Compiler | Total MCycles | Average Sim Time (seconds) | Gain |
| OSCI 2.0.1/GNU 2.95.3 – O2 | 500 | 172.8 | 0.00% |
| OSCI 2.0.1/GNU 3.3 – O2 | 500 | 161.8 | 6.37% |
| OSCI 2.0.1/Intel 7.1 – O2 | 500 | 148.6 | 14.00% |
| OSCI 2.0.1/Intel 7.1 Advanced Optimizations | 500 | 142.2 | 17.71% |
| OSCI 2.0.1/Intel 7.1 Advanced Optimizations with PGO | 500 | 150.2 | 13.08% |
Table 2. PHASE0 Results
| PHASE1 | |||
| Tool/Compiler | Total MCycles | Average Sim Time (seconds) | Gain |
| OSCI 2.0.1/GNU 2.95.3 – O2 | 500 | 273.6 | 0.00% |
| OSCI 2.0.1/GNU 3.3 – O2 | 500 | 269.2 | 1.63% |
| OSCI 2.0.1/Intel 7.1 – O2 | 500 | 230.8 | 15.64% |
| OSCI 2.0.1/Intel 7.1 Advanced Optimizations | 500 | 223.6 | 18.27% |
| OSCI 2.0.1/Intel 7.1 Advanced Optimizations with PGO | 500 | 218.4 | 20.18% |
Table 3. PHASE1 Results
Conclusion
The results show that effective methodologies in modeling and simulation are the number one contributor for the performance of any simulation. The simple functions (SC_METHOD) will outperform threads (SC_THREAD) or clocking components (SC_CTHREAD). The highest priority must be given to the model development methodologies (such as TLM). There is no substitute for well designed models with attention given to simulation performance.
However, the quickest return on investment to gain speed is to use the highest frequency processor, such as a Pentium 4 processor, with an appropriate compiler that takes advantage of the hardware features. Hardware has been steadily improving, more than doubling in speed every two years. This paper has shown that by simply compiling with Intel compilers, a 25+% performance increase can be achieved without any additional effort.
Many EDA vendors are working on “specialized” or faster solutions for SystemC. Modelers will achieve a competitive advantage by using an Intel compiler and existing model libraries that are compiled with an Intel compiler. Speed of EDA vendor tools is consistently a high priority and many times a clear differentiator when choosing between competing products.
ICC compiled libraries can be linked by either GNU or Microsoft compilers, so any stable or release libraries should be compiled and released using ICC.
When simulation speed is a modeler’s concern, the improvement gained using the Intel compilers (20-35% in SystemC) should not be ignored.
References
SystemC v2.0.1 Language Reference Manual, version 1.0, Copyright © 2003 Open SystemC Initiative. www.systemc.org*
Empirical Study of SystemC, Ando Ki, R&D Center, Dynalith Systems, April 23, 03.
System Design with SystemC, Thorsten Grotker, Stan Liao, Grant Martin, and Stuart Swan. Kluwer Academic Publishers, 2002.
About the Authors
Paul Klein is a Sr. Design Engineer with Intel Corporation, Intel Communications Group (ICG), working on next-generation wireless architectures and architecture modeling capabilities. Paul has been with Intel for four years and has developed various EDA tools and capabilities over the past eight years. His current interests include architecture modeling methodologies and power estimation.
Zafer Kadi, Ph.D., is an Engineer with Intel Corporation, Consumer Electronics Group (CEG), working on video decoder architecture. Zafer has been with Intel for four years and has modeled, simulated, and analyzed next generation wireless handset and consumer electronics architectures. His current areas of interest are computer graphics, image and video processing, and wireless technologies.

Comments
I would like to receive an outline of a ataçizado BUs Architecture configured with an Intel Core processor i7, I am a student of electrical engineering for research.
Very grateful.