STAC-A2: oneAPI on Dell PowerEdge Servers with Intel GPU Speeds Up...

Developer Product Marketing Manager

Intel Corporation

The Strategic Technology Analysis Center (STAC®) performed an independent audit and published results of the STAC-A2^TM Benchmark suite on a system featuring Intel’s Data Center GPUs in mid-October (SUT ID: INTC230927). The Intel-authored STAC Pack (the benchmark specification implementation) uses the SYCL open cross-platform abstraction layer in conjunction with the oneAPI unified, multiarchitecture programming model and performance library specification.

As the press release highlights,

the performance of the STAC-A2^TM Benchmark Implementation for oneAPI

on the Dell^® PowerEdge^TM Server sets records. [1]

Some of the record-setting highlights include computation speed performance on large problem sizes [2] and energy efficiency [3].

Even compared to a SUT running on 8 NVIDIA* H100 GPUs (SUT ID: NVDA230721), this STAC-A2 solution running on only 4 Intel® Intel® Data Center GPU Max 1550 shows computational speed advantage for large problem sizes [2]. This is paired with higher space efficiency [4] as measured by HPORTFOLIO throughput speed relative to the physical dimensions of the computational setup. You can access the detailed STAC report here.

The test hardware consists of a 32GB Dell PowerEdge XE9640 dual-socket system configured with Intel® Xeon® Platinum 8468 processors running at 2.1 GHz and 4 Intel® Data Center GPU Max 1550. and 32GiB memory. The operating system installed on this system is Ubuntu* 22.04.3 LTS.

STAC-A2 Benchmark Implementation for oneAPI (Rev O)

Ubuntu* Linux 22.04.3 LTS

Dell PowerEdge XE9640 with

4 x Intel® Data Center GPU Max 1550
2 x Intel® Xeon® Platinum 8468 processors @ 2.10 GHz
2 x 16 GiB DDR5 DIMMs @4800 MT/s (32 GB total)

STAC-A2 is a compute and memory-intensive industry benchmark in the field of financial market derivatives risk analysis. STAC provides technology research and testing tools based on community-source standards with input from leading financial sector trading companies, universities, and high performance computing vendors. The STAC Benchmark Council™ consolidates these inputs and defines the STAC-A2 benchmark specifications.

The oneAPI Specification and implementation projects bring compatibility with SYCL using an open source compiler implementation based on the LLVM* and Clang* projects. This makes it possible to enable multi-vendor targets while taking advantage of the power and performance portability offered by the SYCL programming model.

The oneAPI Specification is now part of the Unified Acceleration (UXL) Foundation hosted by the Linux Foundation’s Joint Development Foundation. The foundation builds on the significant contributions of the oneAPI specification and existing open source projects.

oneAPI spec elements used for developing the optimized STAC-A2^TM benchmark implementation on the Dell PowerEdge Server include:

SYCL	oneAPI's core language for programming accelerators and multiprocessors.
oneDPL	A companion library for programming oneAPI devices with APIs from C++ standard library (STL), Parallel STL, and parallel programming extensions.
oneTBB	Library for adding thread-based parallelism to complex applications on multiprocessors.
oneMKL	High-performance math routines for science, engineering, and financial applications.
Level Zero	Low-level direct-to-metal system interfaces for oneAPI languages and libraries.

STAC-A2 Benchmark Fundamentals

STAC-A2 is comprised of three sets of reference workloads for performance measurement.

Base Components: square roots, exponentials, logarithms, unit-normal random number generation, correlated random number generation, path generation, and payoff computation with early exercise.
Monte Carlo estimation of Heston-based Greeks for path-dependent, multi-asset options with early exercise consisting of Theta, Rho, Delta, Gamma, Cross-Gamma, Model Vega, Correlation Vega, and All Greeks implementations.
HPORTFOLIO: A portfolio of options with homogeneous problem size to measure throughput.

This set of reference benchmarks is considered a very good representation of the computational challenges inherent in market risk assessment. The STAC Test Harness* is anchored by Greeks, a set of partial derivative functions used to calculate financial instruments' price sensitivity, such as options to underlying market changes.

Understanding and predicting Options pricing is critical for many aspects of today's financial markets. Many financial instruments like equities, equity indices, commodities, bonds, currencies, convertible debt contracts, and even derivatives like swaps and futures depend on and are influenced by the options market. This makes it critically important for financial markets to be able to model their behavior reliably.

“The Greeks” is a set of risk parameters associated with options. A Greek symbol is used to designate each of these risks. There is a long list of risks that are being tracked. The most common ones are:

Theta: time decay or sensitivity of the option value to the passage of time
Rho: sensitivity of an option's value to interest rate changes.
Delta: sensitivity of an option’s value to changes in the underlying asset price.
Gamma: second-order option value sensitivity to the rate of change between an option's delta and the underlying asset's price.
Vega (nu): sensitivity of an option’s value to underlying asset price volatility.

With STAC-A2, the Heston model is used to calculate the set of Greek derivatives rather than the simplified Black-Scholes model. The Heston model describes the evolution of volatility of a financial asset over time, assuming that asset valuation and volatility are both neither constant nor deterministic but rather stochastic, following a random pattern of change.

The partial differential equation at the heart of the Heston model is calculated through a large number of continuous random walks through its input parameters using the Monte Carlo simulation approach.

Like many models in finance, the Heston model is a partial differential equation that cannot be solved exactly, except in special cases. Using it to price options requires mathematical approximation techniques. One of the most popular techniques is the Monte Carlo method, which essentially discovers a result by running a large number of randomized simulations.

This very compute-intensive iterative process makes financial asset risk modeling an ideal candidate for highly parallel programming models.

The STAC-A2 Implementation for oneAPI

The Stack under Test (SUT) uses the STAC-A2 Test Harness in conjunction with Intel’s STAC-A2 implementation for oneAPI (Rev O) targeting CPUs and GPUs. It was designed from the ground up using oneAPI, an open, standards-based programming model.

The base components and Greeks code heavily leverage C++ with SYCL to create flexible GPU-friendly algorithms within a cross-platform abstraction layer framework that can support CPU, GPU, and accelerator architectures, dispatching routines to the compute engine most suited for the task. The implementation and optimization work for parallel execution, GPU device selection, and SYCL queue assignment was guided by the book:

James Reinders et al.

Data Parallel C++: Programming Accelerated Systems Using C++ and SYCL.

2^nd ed., Springer Apress, 2023, ISBN-13: 978-1484296905

and the oneAPI GPU Optimization Guide.

Optimization

The Greeks algorithm performance benefits from the oneAPI Math Kernel Library (oneMKL) SYCL interface API. A variety of readily available optimized routines for random number generation (RNG) and transcendental functions are employed.
oneAPI Threading Building Blocks (oneTBB) are used to structure the parallel execution of the C++ code base and apply graph parallelism using Flow Graphs.
The oneAPI DPC++ Library (oneDPL) provides the parallel C++ STL extensions and std::for_each based range iterator implementation.
The HPORTFOLIO throughput benchmark component of STAC-A2 uses the Intel® MPI Library to control node dispatch latency and run 8 options simulations in parallel using a single MPI run per GPU tile without decomposition.
The Profiling Tools Interfaces for GPU (PTI for GPU) open-source project and its onetrace Tracing and Profiling Tool were used to find and address imbalances in discrete GPU utilization.
Finally, the Intel® Vtune™ Profiler helped to identify performance bottlenecks and optimize for reduced GPU offload memory latency and best work item batch size. The data views used were GPU Offload, Memory Latency, and Stall Reasons Analysis.

Compilation

The SUT was compiled using the Intel® oneAPI DPC++/C++ Compiler, a oneAPI-compliant C/C++ compiler with full SYCL support based on Clang* and LLVM*. LLVM’s modular and open architecture makes it an ideal vehicle for Intel to contribute our code optimization insights to the open-source community while also being part of a thriving ecosystem of software developers. The modular design of the LLVM framework makes it possible for the compiler to embrace oneAPI’s vision, a unified, standards-based, open programming model to achieve productivity and performance portability with a common open source approach to multiarchitecture software development.

Execution

The Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL* Driver ensures optimized performant parallel execution and accelerator dispatch onto the four available high-density discrete Intel® Data Center GPUs Max 1550.

An Open Vision for Accelerated Compute

The results of the STAC-A2 derivatives risk SUT performance measurements not only speak for the outstanding capabilities of the Intel® Data Center GPU Max Series. They also speak for the potential of the Unified Acceleration (UXL) Foundation’s vision and the underlying richness of the oneAPI Specification and its Intel reference implementation to get high performance code out of your parallel accelerated compute implementation.

Become part of the effort to make high-performance cross-architecture compute transparent and flexible. Include SYCL as the accelerator and GPU offload solution in your code path. Adopt oneAPI as the means to productivity and maintainability in software development.

The Intel® DPC++ Compatibility Tool and the CUDA to C++ with SYCL Migration Portal are the convenient starting point for your own migration to SYCL.

Download the Intel® oneAPI Base Toolkit and Intel® HPC Toolkit today.

SYCL Resources

STAC-A2 Resources

[1] Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

[2] STAC-A2.β2.GREEKS.10-100k-1260.TIME.WARM and STAC-A2.β2.GREEKS.10-100k-1260.TIME.COLD

[3] STAC-A2.β2.HPORTFOLIO.ENERG_EFF

[4] STAC-A2.β2.HPORTFOLIO.SPACE_EFF

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex. Results may vary.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

“STAC” and all STAC names are trademarks or registered trademarks of the Strategic Technology Analysis Center.

*Other names and brands may be claimed as the property of others.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

STAC-A2: oneAPI on Dell PowerEdge Servers with Intel GPU Speeds Up Finance Market Risk Analysis

Get the Latest on All Things CODE