Lately I have been working on several custom micro benchmarks for industrial automation vendors. These benchmarks measure perfromance characteristics of embedded platforms that are relevant for achieving hard real time response for high frequency (8-32kHz) motion control applications. Some of the benchmarks and a runtime could be useful for a wider audience so I am publishing them now.
The RTBench suite contains a collection of several micro-benchmarks, scripts that run them and process the results, and a baremetal runtime that is used to run the benchmark without any disturbance from a host OS. To make running and debugging benchmarks easier, it is also possible to compile and run them as Linux user mode applications. The benchmarks are portable and can be compiled with GCC or ICC.
The micro-benchmarks included in the release are the following:
1. PLC compiler backend simulation. Most of PLC (IEC61131-3) compilers do not have an optimizing back-end comparable to state of the art backends of popular mainstream compilers: gcc, llvm, icc. So in PLC code I see in the field very often there are patterns which are stressing the CPU front end and I-cache, causing front end stalls. The micro-benchmark is not doing any meaningful work but is rather simulating these patterns.
2. Motion control. Motion control workloads involve periodic floating point calculations of a certain type, and some IO. This code simulates control of 16 axis.
3. Message passing between cores via shared memory. In hard real time systems, using multicore is not yet as prelevant as elsewhere, but is starting to emerge. Message passing over shared memory between cores is usually used (e.g. MCAPI). This benchmark measures the speed of message passing and synchronization across multiple cores.
4. Interrupts jitter. If there is an Intel I210 NIC installed, it is possible to run this test which measures the jitter in time it takes by the system from raising the interrupt till start of ISR execution. It does not measure actual latency! Still jitter it also an important metric. It may vary on different platforms, even platforms with same CPU and chipsets. This benchmark is not yet stable and is not included to the current release.
For all the benchmarks above, the relevant metrics are average and worst exectution timse in CPU cycles. In a perfect world these two numbers are equal, but when running the code on a modern out of order CPU that features power management, shared caches, hyperthreading, etc the worst case perfromance number is always higher than average. Switching off hyperthreading, C-states and all other power management features, controlling all SMI sources, etc helps tremendously, but still does not eliminate all causes of possible longer execution times. These outliers size is an important metric when it comes to selecting the platform capable of high perfromance real time computing.
There is some disturbance from OS, so to make sure we are only measuring a platform's perfromance a baremetal runtime is used. The runtime is designed to run together with Linux OS. On boot, Linux allocates a physical memory hole to be used by baremetal applications. When running the tests for baremetal targets, we offline all cores but one. (So Linux does not schedule MSI-X interrupts and does not schedule threads execution on offlined cores). There is still some cross-interference via shared cache, IMC and PCIe, but it can be contained. This architecture is a fair compromise between ease of use and development and non-disturbing nature of the true baremetal environment. (Real baremetal environment could be used too, see TARGET=__baremetal_standalone__ build) So I think it might be used for other applications besides benchmarking - e.g. for real time workloads consolidation and for porting microcontroller loops to x86. The corss-talk is a major concern in workload consolidation scenarios, so it was worth adding a special running mode where we can measure its effect on the perfromance. The benchmark scripts contain a reference to the STREAM benchmark (not included), that can be run on core 0 simultaniously to measure the effect of sharing last level cache and memory bandwidth.
Some of the code of the baremetal runtime is based on JamesM tutorial, a BSD licensed code. (Thanks a lot James Molloy!) Also many thanks to Patrick Lu, James Coleman, Neil Stroud.
1. To fix the Linux kernel issue with a memory hole pre-allocated on boot. Some newer 3.1+ kernels crash on memory allocation.
2. To add the VxWorks target.
3. To finish I210 interrupt jitter test.
4. To finalize and freeze the benchmarks, so that it will be possible to compare the results between generations of embedded platforms.
5. To integrate Intel PCM to baremetal to make performance tuning easier.
Upd: Real baremetal target, bootable from GRUB was added in 0.32b
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804