Optimizing the High Frequency Trading GatiRT* Application on the latest Intel® Architecture Server

By: Aditi Rathi (Intel) and Shailender Sharma (Gati)

 

ABSTRACT

High frequency trading (HFT) is a form of algorithmic trading where trade is carried out in microseconds and low latencies are achieved using high-end servers and very efficient computer algorithms. In the trading world, fast is never fast enough since traders profit by getting information (i.e., bids and offers) and placing trades faster than their competitors. GatiRT*, from Gati Technologies, is a set of applications that achieve single-order microsecond latencies by deploying highly efficient and fast algorithms tuned and optimized for the latest Intel® architecture. This paper describes two applications: GatiRT Feed-Handler (GatiRT FH) and GatiRT Order-Book/Super-Book builder (GatiRT OB) that address the most time-consuming portions of the forward trading path.

Both GatiRT FH and GatiRT OB were specifically designed to be single and double threaded respectively. GatiRT FH is a kernel module while GatiRT OB is a user-space application that utilizes only two cores of a dual socket 12cores/socket machine (leaving the rest for trading strategies and order routing modules). By performing certain system and OS level tunings on the latest Intel® Xeon® processor E5-2600 v2, the applications together achieved 36.5% latency reduction. The optimized version performed 15% faster on the latest Intel® Xeon® processor E5-2600 v2 than the older Intel® Xeon® processor E5-2600. These performance measurements were made on a test set-up on Intel® Software Development Platforms at Intel’s Performance Lab in Bangalore, India.

Ω Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

INTRODUCTION

Application Architecture

GatiRT Feed-Handler (GatiRT FH) is a kernel module that starts up immediately after the Ethernet driver and does all the feed-specific packet processing and packet normalization into feed-independent format. The module also performs intelligent filtering, or symbol subscription-based filtering, where packets are filtered out based on selective symbols and order-ids before processing complete messages.

GatiRT Order-Book/Super-Book builder (GatiRT OB) runs in user space. It receives generic feed-independent processed data from GatiRT FH and builds the Order-Book and Super-Book. The Order-Book is a live snapshot of bids and offers that are available and executed for specific symbols, and the Super-Book is a cumulative order book of multiple symbols. Once the Order-Book is available, we can run multiple trading algorithms and execute trades on the exchange based on the profitability of the symbol (stock). GatiRT also provides an API to read the Order-Book via a shared memory interface for easy integration with a customer’s trading strategies and other third-party software modules.

The following diagram explains how GatiRT applications interconnect with various other software modules to form the overall trading process.

The GatiRT OB application processes a variety of feeds (ITCH, BATS, Direct Edge - DEA and DEX, NYSE*, ARCA, and consolidated feeds such as CQS and UQDF) and creates order books for various symbols in single-digit microsecond latencies.

As the diagram shows, in the forward path of an HFT solution, the main areas of latency are:
(a) Data path from exchange to server
(b) Data feed from network port to application within the server
(c) Application processing

When an order is created, the process executes in reverse i.e. from the client server back to the Exchange where the order is processed. As this is a very competitive market, most companies are co-located within the exchange. Hence the latency (a) is the same for all the co-located traders. The application processing latency (c) is also very small due to the high speed servers with some variation depending on the algorithm and calculations that the traders decide to use. Typically the latency (c) is less than a microsecond. The main area of concern is the latency (b). Without using feed handlers, the feeds take up to double-digit latencies coupled with jitter to reach the application

GatiRT solves that problem by providing a deterministic single-digit microsecond solution. This reduces latency (b) by a factor of a few thousands and makes it deterministic. Building an order book and super book simplifies and speeds up initial data processing under (c). These characteristics make GatiRT a unique solution with a significant competitive advantage over other contemporary solutions.

The GatiRT OB is composed of the following components:

  • Kernel module: Performs feed analysis for various supported feeds, performs filtering based on symbol subscription or order-ids and also creates a common format for all feeds so that user-space application can perform generic operations irrespective of the feed.
  • User-space module: Creates the Order Book and Super Book for all feeds per symbols bases. This module also exposes an API to make the Order Book available to third-party software, which provides flexibility to end users to attach a trading strategy. The module also generates various logs with data for network and overall latency analysis.
  • Subscription binary: Is used to subscribe to only those symbols that the user needs for trading.

The GatiRT OB application logs a lot of filtering statistics for every subscription. This facilitates debugging for missing messages and various other kinds of errors. Feed latencies per message type can be logged into separate feed-specific latency_feed.txt files. The logged statistics can also be displayed via graphical user interface-based network management system (NMS) tools (Cacti or Nagios).

Intel® Architecture

The GatiRT OB application was ported and tuned on both 2-socket E5-2600 v2 and E5-2600 platforms. Considering that the kernel module becomes a part of the OS stack, the user-space module was tied to two cores to ensure it occupies only two cores. The idea is to leave as many cores available as possible for other software modules to run on such as the customer’s strategy engine, risk management and order routing modules, etc. The idea is to make standard modules as lightweight as possible and offer our customers maximum flexibility and efficiency on the same machine. That’s where having 50% additional cores on Intel® Xeon® processor E5-2600 v2 are a big benefit. The relevant micro-architectural benefits of Intel® Xeon® processor E5-2600 v2 over Intel® Xeon® processor E5-2600 are shown in the diagram below. The E5-2600 v2 comes with 50% extra cores, 50% extra LLC and faster memory speed support compared to the previous generation. There are many other micro-architectural benefits, some of which are tabulated below.


Intel® Xeon® Processor E5-2600 v2 (IVB) Features

Microarchitecture Intel’s industry-leading 22nm 3-D Tri-Gate transistor technology
Energy-efficient Performance Up to 45% more energy efficient
and up to 50% greater performance with 50% extra cores and 50% extra last-level cache (LLC)
Intel® Advanced Vector Extensions (Intel® AVX) CPU instructions for accelerating floating point operations used in a variety of computing applications.
Float 16 support added to accelerate data conversion between 16-bit and 32-bit floating point formats.
Intel® Turbo Boost Technology 2.0^ Accelerates processor performance for peak loads; 2.0 comes with more intelligent algorithm that separates I/O waits
Intel® Hyper-Threading Technology (Intel® HT Technology) Thread-level parallelism benefits multi-threaded and concurrently running applications
PCI Express* (PCIe) 3.0 ports Extra capacity and flexibility for storage and networking connections. Up to double the I/O bandwidth of the prior-generation PCIe 2.0
Serial ATA 3.0 (SATA 3.0) Provides faster data access, system startups, and application load times. Doubles data throughput of the previous generation for faster hard drive performance
Intel® Direct Data I/O Technology (Intel® DDIO) Improves I/O performance by intelligently directing I/O packets to the processor cache, skipping main system memory
Intel® Integrated I/O Integrates I/O controller into the processor. Improves I/O bandwidth through support for PCIe 3.0.

^ Requires a system with Intel® Turbo Boost Technology. Intel® Turbo Boost Technology and Intel® Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and system configuration. For more information, visit http://www.intel.com/go/turbo

∞ Available on select Intel® Core™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.

For more information on the Intel® Xeon® processor E5-2600 v2, see http://software.intel.com/en-us/articles/intel-xeon-processor-e5-2600-v2-product-family-technical-overview. For more information on various financial applications optimized on various Intel technologies, please visit www.intel.com/financialservices.

See Appendix A for a table comparing system configuration details for the two processors.

Test Cases and Analysis of the Test Results

For performing tests on GatiRT OB application, the following setup was created, where Gati02 is an Intel® Xeon® processor E5-2600 v2-based server running the GatiRT application and Gati01 is the traffic generator system.

All test cases were executed by replaying a packet capture file containing the live data of all the exchange feeds using tcpreplay tool. Identical real-time feeds were sent via the following command for all tests:
tcpreplay --limit=50000000 --multiplier=3 --stat=5 -i eth0 20110811_2_evening.pcap

The output of this command is the data rate and packets per seconds sent. A snapshot of it is shown below:

At the end of the input traffic, we run average latency calculation scripts that calculate average latency per feed for all packets logged in latency_.txt files and the overall average latency for all feeds. The following snapshot shows a sample of the average latency calculation.

We conducted the following five test cases with different system and OS tunings on Intel® Xeon® processor E5-2600 v2 architecture, then calculated the average latency of three consecutive iterations for each one.

Test Case 1. Default Settings on Intel® Software Development Platform (SDP). See Appendix A for configuration details.

BIOS and OS Settings All default settings used. (Intel® Turbo Boost enabled, Intel® Hyper-Threading enabled, C-states enabled, NUMA enabled, balanced power and performance settings, etc.)
Output ArcaPackets: 594957, AvgLatency_arca: 6737 nanoseconds
  BatsPackets: 1563686, AvgLatency_bats: 3575 nanoseconds
  DE_APackets: 777859, AvgLatency_de_a: 2538 nanoseconds
  DE_XPackets: 496947, AvgLatency_de_x: 2727 nanoseconds
  ITCHPackets: 3781083, AvgLatency_itch: 4763 nanoseconds
  NYSEPackets: 2425767, AvgLatency_nyse: 1907 nanoseconds
Average latency of all feeds 3708 nanoseconds

Test Case 2. BIOS Tuning and Disabling Un-needed Features

BIOS and OS Settings List of BIOS tunings are listed in Appendix B. Intel® Hyper-Threading Technology disabled since multiple cores are not required until integrated testing with our customer modules is done
Output ArcaPackets: 594957, AvgLatency_arca: 5505 nanoseconds
  BatsPackets: 1563686, AvgLatency_bats: 3408 nanoseconds
  DE_APackets: 777859, AvgLatency_de_a: 2427 nanoseconds
  DE_XPackets: 496947, AvgLatency_de_x: 2616 nanoseconds
  ITCHPackets: 3781083, AvgLatency_itch: 4506 nanoseconds
  NYSEPackets: 2425767, AvgLatency_nyse: 1871 nanoseconds
Average latency of all feeds 3389 nanoseconds

∞Available on select Intel® Core™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.

Test Case 3. Disable Intel® Turbo Boost Technology^

BIOS and OS Settings Since only two cores were being used, enabling Intel® Turbo Boost technology caused a jitter of about 7%. Hence it was tentatively disabled, in favor of definitive latency, even at the cost of increase in latency. It would be useful to enable it and test once the system is more utilized with additional modules running as enabling Turbo Boost and affinitizing critical path processes normally results in better performance.
Output ArcaPackets: 594957, AvgLatency_arca: 5584 nanoseconds
  BatsPackets: 1563686, AvgLatency_bats: 3468 nanoseconds
  DE_APackets: 777859, AvgLatency_de_a: 2427 nanoseconds
  DE_XPackets: 496947, AvgLatency_de_x: 2665 nanoseconds
  ITCHPackets: 3781083, AvgLatency_itch: 4638 nanoseconds
  NYSEPackets: 2425767, AvgLatency_nyse: 1870 nanoseconds
Average latency of all feeds 3450 nanoseconds

^ Requires a system with Intel® Turbo Boost Technology. Intel® Turbo Boost Technology and Intel® Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and system configuration. For more information, visit http://www.intel.com/go/turbo

Test Case 4. Linux Power Governor set to “performance” (default being “on-demand”)

BIOS and OS Settings Turn on the ACPI Idle driver; Set Linux Power Governor to performance mode – this ensures that all CPU cores run at highest rated frequency.
Output ArcaPackets: 594957, AvgLatency_arca: 3755 nanoseconds
  BatsPackets: 1563686, AvgLatency_bats: 3022 nanoseconds
  DE_APackets: 777859, AvgLatency_de_a: 2184 nanoseconds
  DE_XPackets: 496947, AvgLatency_de_x: 2386 nanoseconds
  ITCHPackets: 3781083, AvgLatency_itch: 3984 nanoseconds
  NYSEPackets: 2425767, AvgLatency_nyse: 1620 nanoseconds
Average latency of all feeds 2825 nanoseconds

Test Case 5. IRQ Balance turned off

BIOS and OS Settings IRQ Balance turned off. Application affinitized to  two CPU cores of the same socket using numactl so that referenced memory pages are also affinitized to the same socket. Soft and hard network IRQs affinitized (by creating right mask in smp_affinity) to all cores (on the same socket) other than the two to which application threads were tied. Same socket ensures network data sharing over last-level cache (LLC)
Output ArcaPackets: 594957, AvgLatency_arca: 3160 nanoseconds
  BatsPackets: 1563688, AvgLatency_bats: 2536 nanoseconds
  DE_APackets: 777860, AvgLatency_de_a: 1788 nanoseconds
  DE_XPackets: 496947, AvgLatency_de_x: 1963 nanoseconds
  ITCHPackets: 3781081, AvgLatency_itch: 3421 nanoseconds
  NYSEPackets: 2444526, AvgLatency_nyse: 1260 nanoseconds
Average latency of all feeds 2255 nanoseconds

These test results are represented in a graphical manner below for easier comparison.

Conclusion

Gati Technologies Inc. believes in tightly coupling their software releases to Intel’s ever evolving technology roadmap. Their application is intelligently developed to scale well, and hence with only system- and OS-level tunings, their software can be optimized for every new generation. Tests on GatiRT Feed-Handler (GatiRT FH) and GatiRT Order-Book/Super-Book builder (GatiRT OB) saw an approximate 36.5% latency reduction from out-of-box performance to tuned performance on Intel® Xeon® processor E5-2600 v2

We performed the same tests and tunings on both the Intel® Xeon® processor E5-2600 and the latest E5-2600 v2 architectures for an apples-to-apples comparison. Even with the under-utilized system, the optimized version performed 15% faster on the latest Intel® Xeon® processor E5-2600 v2 than the older Intel® Xeon® processor E5-2600. In case of a busy system (with all the other modules of a trading system integrated), the performance is bound to get much higher with better utilization of all resources. Some other tunings that will work in that case are: using faster memory supported by E5-2600 v2 and turning on Intel® Hyper-Threading Technology and Intel® Turbo Boost Technology while affinitizing cores and memory pages for different modules (this will ensure reduced jitter).

These performance measurements were made on a test set-up on Intel® Software Development Platforms at Intel’s Performance Lab in Bangalore, India.

Ω Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

About Gati Technologies, Inc.

Gati Technologies is a leading global company providing solutions for High Frequency Trading. GatiRTTM ultra-low latency feed handler is among the fastest feed handlers in the world. It delivers low single digit microsecond latency from port to application memory. GatiRTTM is built using patent pending technology on commercially available Intel® based servers. GatiRTTM provides not only the fastest quotes but also at an unprecedented cost/performance advantage. Gati provides services to its client allowing them to quickly integrate GatiRTTM with their trading platform.

Appendix A

Comparison of the Two Intel® Xeon® Processors and SDPs

System Configurations Intel® Xeon® processor E5-2600 v2 product family-based servers (codename: Ivy Bridge-EP) Intel® Xeon® processor E5-2600 product family-based servers (codename: Sandy Bridge-EP)
Platform 2S IVB Romley-EP 2S SNB Romley-EP
Processor Intel® Xeon® processor E5-2697 v2 Intel® Xeon® processor E5-2680
Chipset Patsburg A+ Patsburg C1-T
Motherboard Canoe Pass (E99552-561,DA0S6CMB8F2 Rev. F) Canoe Pass (E99552-503,DA0S6CMB8F1 Rev. F)
Stepping C0 C1
CPU Frequency 2.7 GHz 2.7GHz
Number of Cores 12 8
Processor TDP 130W 130W
BIOS Version RMLSDP.86I.R2.28.D690 SE5C600.86B.99.99.x044
Memory 64GB 1600MHz 64GB 1600MHz
Operating System RHEL 6.2, 64 bit RHEL 6.2, 64 bit

Appendix B

List of BIOS tunings in test case-2:

  • NUMA enabled
  • Turbo Boost enabled
  • Hyper Threading disabled
  • EIST (Enhahnced Intel® Speedstep Technology) turned on
  • CPU - C state controls (CPU Pkg C State Limit, Core C1/C3, Core C1/C6) all disabled
  • CPU - C states (ALL) – all disabled
  • Energy Efficient P-states – all disabled
  • Mwait/Monitor - disabled
  • Under Memory Thermal, Memory Power Savings Mode disabled, MDLL OFF disabled, MEMHOT throttling mode disabled, Memory Electrical Throttling disabled
  • Long Term Power Limit Override disabled
  • Advanced PM Turning turned to manual
  • Under CPU – Advanced PM Turning, Power Performance Tuning disabled, ENERGY_PERF_BIAS_CONFIG mode turned to PERF, Power/Performance Switch disabled, PERF_P_LIMIT_EN on E5-2600 and PERF_PLIMIT_DISABLE on E5-2600 v2 disabled
  • CPU Power and Performance Policy on E5-2600 turned to Performance
  • Workload Configuration set to UMA on E5-2600 and Default on E5-2600 v2
  • Active CPU cores set to 12 on E5-2600 v2
  • All virtualization related features VT or VT-x or VT-d, AES-NI support, disabled
  • Under QPI Phy & Link Layer, Link L1 Enable enabled, Link Speed Mode set to Fast
  • QPI Frequency Select set to Auto Max on E5-2600
  • Memory Performance set to High Frequency, Memory Voltage set to Auto, Memory Frequency set to 1600 on both E5-2600 and E5-2600 v2; Memory Power Optimization set to Performance Optimized on E5-2600
  • Under memory settings, Attempt Fast Boot disabled, Multi-threaded MRC set to Auto
  • Under ACPI settings, BDAT ACPI Table Support disabled, and Enable Hibernation disabled

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

Intel, the Intel® logo, and Xeon® are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2014 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

Appendix C

Additional Resources

如需更全面地了解编译器优化,请参阅优化注意事项