Heterogeneous STAC-A2* on the Intel® Xeon® Processor and Intel® Xeon Phi™ Coprocessor

Introduction

STAC-A2 is a set of specifications which describe workloads used in market-risk analysis. The specification of the benchmark is maintained by the Securities Technology Analysis Center (STAC), and is defined by leading financial institutions, academia and hardware vendors. The models used in STAC-A2 represent realistic market risk analysis workloads.

STAC-A2 evaluates risk sensitivities or hedge parameters, called Greeks. They are called Greeks because they are often represented by Greek letters. The Greeks describe the sensitivity of the price of American-style options to changes in parameters of the underlying market, such as changes in the interest rate. For the most common types of options, called vanilla options, the Delta Greek is a measure of the rate of change of an option value with respect to changes in the underlying asset’s price. Other examples include Vega, a measure of sensitivity to volatility; Theta, a measure of sensitivity to the passage of time; and Rho, a measure of sensitivity to the interest rate.

There are also whole sets of second- and third-order Greeks that are even more computationally intensive since they are calculated with second- and third-order derivatives, respectively. The specification requires a computation of total seven types of Greeks: Theta, Rho, Delta, Gamma, Cross-Gamma, Model Vega, and Correlation Vega. The price of the American-style option relies on the Monte Carlo Least Squares based algorithm. The pricing model is based on the Heston stochastic volatility model1 and the payoff calculation is based on the Longstaff-Schwartz model2. Computations rely on double-precision floating-point data.

The largest banks have thousands of computers configured in grids running these types of calculations every day, which was traditionally done overnight in batch mode, but is increasingly shifting to intra-day or, if possible, in real-time, which are defined as less than one second processing time.

The Three Levels of Parallelism

In “Three Layer Cake for Shared Memory Programming”3 the authors describe common styles for shared-memory programming: SIMD, fork-join, and message passing. They claim that often a single model is not sufficient for an application, so these layers must be created and composed (Figure 1). STAC-A2 is an example for such an application. Moreover, the latest heterogeneous implementation of STAC-A2 extends the model to non-shared memory programing while maintaining the three levels.

The three layer cake for shared memory programing

Figure 1: The three layer cake for shared memory programing.

The SIMD Layer

One of the major features introduced in OpenMP* 4.0 specification4 is an ability to explicitly enable vectorization in a program. The #pragma omp simd (Figure 2) instructs the compiler to enforce vectorization of a given loop that the compiler does not normally auto-vectorize. This directive is designed to minimize the amount of source code modification required in order to obtain SIMD code.

void sum(const int* in1, const int* in2, std::size_t size, int* out)
{
    #pragma omp simd
    for(int  i=0; i<size; ++i ){
        out[i] = in1[i] + in2[i];
    }
}
Operation principal of the OpenMP* 4.0 SIMD directive

Figure 2: Operation principal of the OpenMP* 4.0 SIMD directive.

However, a real application scenario of #prama omp simd may require a loop transformation. For example, in the Monte Carlo price generator, which is used in STAC-A2, the innermost loop has data dependencies and therefore #pragma omp simd cannot be applied. The solution is to perform loop interchange and apply the Monte Carlo path loop to become the innermost (Figure 3).

Original Code

for (unsigned p = 0; i < nPaths; ++p)
{
    double mV[nTimeSteps];
    double mY[nTimeSteps];
…..
for (unsigned int t = 0; t < nTimeSteps; ++t){
       double currState = mY[t] ;
       ….
       double logSpotPrice = func(currState, …);
       mY[t+1][p] = logSpotPrice * A[t];
       mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p];
       price[t][p] = logSpotPrice*D[t] +E[t] * mV[t][p];
   }
}

Figure 3(a): Dependency resolution by loop transformation.

Modified Code

double mV[nTimeSteps][nPaths];
double mY[nTimeSteps][nPaths];
…
for (unsigned int t = 0; t < nTimeSteps; ++t){
    #pragma omp simd
    for (unsigned p = 0; i < nPaths; ++p)
    {
       double currState = mY[t][p] ;
       ….
       double logSpotPrice = func(currState, …);
       mY[t+1][p] = logSpotPrice * A[t];
       mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p];
       price[t][p] = logSpotPrice*D[t] +E[t] * mV[t][p];    }
}

Figure 3(b): Dependency resolution by loop transformation.

The fork-join layer

Intel® Threading Building Blocks (Intel® TBB)5 is used to support the fork-join parallelism. The loop level constructs tbb::parallel_for and tbb::parallel_reduce are used by STAC-A2. Although the #pragma omp simd is part of the OpenMP 4.0 specification, it doesn’t require OpenMP threading to be used. The directive coexists with the Intel TBB parallel algorithms. Figure 4 contains the extended version of the vectorized loop from Figure 3 which includes the tbb::parallel_for() construct.

parallel_for(blocked_range<int>(0, nPaths),
    [&](const blocked_range<int>& r) {
        cosnt int block_size = r.size();
        double mV[nTimeSteps][block_size];
        double mY[nTimeSteps][block_size];
        …
        for (unsigned int t = 0; t < nTimeSteps; ++t){
            #pragma omp simd
            for (unsigned p = 0; i < block_size; ++p)
            {
                double currState = mY[t][p] ;
                ….
                double logSpotPrice = func(currState, …);
                mY[t+1][p] = logSpotPrice * A[t];
                mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p];
                price[t][r.begin()+p] = logSpotPrice*D[t] +E[t] * mV[t][p];
           }
}

Figure 4:SIMD enabled loop enhanced by parallel_for construct.

The message passing layer

Heterogeneous implementation of STAC-A2 uses an Intel TBB flow graph API 6. This API is used for expressing dependency and data streaming graphs. An Intel TBB flow graph consists of a graph object, nodes, and edges. The graph object is used as a handle to the graph, the nodes represent the computations, and the edges represent the channels for messages between the nodes.

Figure 5 shows a simple “Hello World” flow graph. Two nodes are created: h that prints “Hello” and w that prints “World”, and an edge is created between them. Whenever a message is received by h, an Intel TBB task is spawned to execute its body. When the body of h completes, it sends a message to w, triggering w’s execution.

Intel® Threading Building Blocks flow graph &quot;Hello World&quot; example

graph g;
continue_node< continue_msg> hello( g,
      []( const continue_msg &) {
          cout << "Hello";
      } );
continue_node< continue_msg> world( g,
      []( const continue_msg &) {
          cout << " World\n";
      });
make_edge(hello, world);
hello.try_put(continue_msg());
g.wait_for_all();

Figure 5: Intel® Threading Building Blocks flow graph "Hello World" example

The Intel TBB flow graph spawns Intel TBB tasks to execute the bodies of its nodes. These tasks are scheduled by the same work-stealing scheduler that is used to execute the tasks spawned by the Intel TBB fork-join algorithms.

Extending Intel® Threading Building Blocks Flow Graph for Intra-Node Communications

The Intel TBB flow graph, and Intel TBB in general, was initially targeted at multicore systems with shared memory. For the heterogeneous implementation of the STAC-A2 the shared memory model is extended by support for a distributed memory as required by the Intel® Xeon Phi™ coprocessor. Because Intel® Xeon® processors and Intel Xeon Phi coprocessors can both be independently programmed with standard C/C++, OpenMP directives and Intel TBB to express all levels of the parallelism hierarchy, it is unnecessary to add a new programming model to the mix. Instead, the Intel TBB flow graph infrastructure is extended to support graph execution on systems with distributed memory.

Even if the same programming models are supported on all devices, there are still two challenges in extending the Intel TBB flow graph for a distributed memory environment:

  • To execute nodes on a remote device, the flow graph must transfer the input and output data of the node. User-defined data types will require user-provided functions for serialization.
  • The work (nodes) must be distributed across the devices in such a way as to get good performance. On a heterogeneous system, the work scheduling must take into account that nodes may execute with different performance on different devices. Optimal work distribution depends on the particular application and input data, and a mistake in work distribution may cause a significant performance degradation.

Intra-device work distribution: distribution_node

To address the first challenge, a new node type called distributor_node was implemented. This node type is used to identify a sub-graph that can be executed either on an Intel Xeon processor or an Intel Xeon Phi coprocessor. It gives users the ability to explicitly identify where communication across devices is safe and desirable, to be a place where serialization functions can be registered and act as the point where user-defined scheduling decisions are applied. The basic idea of the distributor_node is illustrated in Figure 6. A local copy of the flow graph resides on all devices available for execution. Using this new node type a user explicitly identifies points where distribution is allowed and desirable.

An example of distribution_node usage

Figure 6: An example of distribution_node usage.

The distributor_node is responsible for three main steps: (1) select the device where data should go, (2) serialize/de-serialize the data, and (3) perform the data transfer. The implementation of the distributor_node overlaps computations performed by Intel TBB threads with communication that is done outside Intel TBB in a special service thread. The use of an Intel TBB async_node7 allows the flow graph to communicate asynchronously with the communication thread.

Work scheduling: Tokens

To address the second challenge a technique known as a token-based scheduling8 is used. This technique is commonly used to solve complex load-balancing scenarios, and it has been found to be suitable for STAC-A2. In the STAC-A2 implementation each compute device is represented by a number of tokens. The number is determined by a device type. The number of tokens allocated for the main/host processor is based on the number of physical cores. For example, the Intel Xeon processor E5-2697 v3 has 14 cores; therefore, for a dual-socket system, 28 tokens are allocated, and for each one of the coprocessors 15 tokens are allocated.

Some Greeks require the “base prices” to be created in advance and then later to be used by independent payoff tasks. Other Greeks don’t require the pre-generated prices and can be executed immediately.

In order to minimize the execution time, the co-processors starts execution as soon as possible, and the second group of tasks is immediately sent for execution on the Intel Xeon Phi coprocessor. This is achieved by adding a priority field to the token object, and the token pool is managed as a priority queue.

STAC-A2 Flow Graph Construction

The distributor_node, as depicted in Figure 7, is specialized with a custom scheduling policy, which is capable of handling the token-based distribution scheme. The new node type offload_node has two inputs ports: the input data and the token; and two output ports: the output data and the used token. During execution, the offload_node does not consume any token until data are available on the data input port. For this purpose, the join_node with the reservation flag is used. Once both the data and the token are available, a pair is created and transferred to the distributor_node. The distributor_node parses the token attributes, transfers the data to the appropriate compute device, and triggers the execution on that device. If the target device is a host, the data transfer is not required and the task is immediately executed.

offload_node topology

Figure 7: offload_node topology.

The BASELINE test of STAC-A2 computes 7 Greeks with 5 assets. Some Greeks, like Theta and Rho, require a single task. Others, like Cross-Gamma and Correlation Vega, require up to 20 tasks. When multiple tasks are required for a Greek, a join_node is used to synchronize between the task outputs. After the results are collected, the Greek value is evaluated and transferred into the Result Collector. Figure 8 depicts the Intel TBB flow graph constructed for the STAC-A2 benchmark.

STAC-A2* flow graph diagram

Figure 8: STAC-A2* flow graph diagram.

By the STAC-A2 distributed graph design, each compute device has the full set of tasks. The host determines which device to execute the task on based on the available token type. Tokens are managed by the Token Pool, which is a priority queue with execution preference for the Intel Xeon Phi coprocessor. The Price node, which computes the base price values, is executed only on the host. Upon completion, the dependent tasks are triggered for execution. Other tasks that are not dependent on the base price values are triggered by the Start node.

Using the proposed distributor_node with the offload feature would not be feasible without the heterogeneous support from the Intel® C++ Composer XE compiler. The compiler supports multi-target compilation and is capable of handling binaries for both a host and a target device. Figure 9 lists the code example of a single node that is defined for both targets.

#pragma offload_attribute(push, target(mic))
typedef execution_node < tbb::flow::tuple<std::shared_ptr<GreekResults>,
                                                                                       device_token_t >,
                                                       double> execution_node_theta_t;
…
void CreateGraph(…) {
…
theta_node.reset(std::make_shared<execution_node_theta_t>(
         _g, [arena, pWS, randoms](const std::shared_ptr<GreekResults>&,
                                                                            const device_token_t& t) -> double {
            double pv = 0.;
            std::shared_ptr<ArrayContainer<double>> unCorrRandomNumbers;
            randoms->try_get(unCorrRandomNumbers);
            const double deltaT = 1.0 / 100.0;
            pv = f_scenario_adj<false>(pWS->r, …, pWS->A, unCorrRandomNumbers);
            return pv;
        }
, true));
…
}
#pragma offload_attribute(pop)

Figure 9:Heterogeneous node code example

The functional code is captured by #pragma offload_attribute(…) clauses which generate code for both devices.

Summary

With the current implementation of STAC-A2 it is possible to measure the performance scaling of the System Under Test (SUT). The SUT consists of two Intel Xeon processors E5-2697 v3 and two Intel Xeon Phi coprocessors 7120P.

STAC-A2* System Under Test scalability results

Figure 10: STAC-A2* System Under Test scalability results.

The blue “Single (cold) run” line measures the time it takes to compute the first iteration and represents the application initialization time. The baseline for the report is the execution time of a minimal system that consists of a single Intel Xeon Phi coprocessor.

The orange “Mean of warm runs” line represents the next four iterations. This test is more suitable for measuring the performance scaling. The scaling data shows that the heterogeneous STAC-A2 implementation can accommodate additional compute devices. The complete system, which is measured on two Intel Xeon processors and two Intel Xeon Phi coprocessors and uses dynamic load balancing and distributed execution, achieves 4.5x better performance than the minimal configuration.

Intel’s heterogeneous implementation INTC151028 and the recently audited Intel® Xeon Phi™ Processor 7250 (code-named Knights Landing) set a record for the GREEKS.TIME, which expresses the latency in which compute of risk evaluation could be done.

STAC-A2* audit summary

Table 1: STAC-A2* audit summary.

Moreover, when measured by the GREEKS.SPACE_EFFICIENCY test, Intel® processor-based platforms have significantly higher space efficiency than competitive platforms (see Table 1).

References

  1. S. L. Heston, "A Closed-Form Solution for Options with Stochastic Volatility with Applications to Bond and Currency Options," Review of Financial Studies, vol. 6, no. 2, pp. 327-343, 1993.
  2. F. A. Longstaff and E. S. Schwartz, "Valuing American Options by Simulation: A Simple Least-Squares Approach," Review of Financial Studies, vol. 14, no. 1, pp. 113-147, 2001.
  3. A. D. Robison and R. E. Johnson, "Three Layer Cake for Shared-Memory Programming," in Proceedings of the 2nd Annual Conference on Parallel Programming Patterns (ParaPLoP), Carefree, AZ, 2010.
  4. OpenMP Architecture Review Board, "http://www.openmp.org," 1 July 2013. [Online]. Available: http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.
  5. Intel Corporation, "Intel Threading Building Blocks (Intel TBB)," Intel, [Online]. Available: https://www.threadingbuildingblocks.org. [Accessed 30 March 2016].
  6. A. Kukanov, V. Polin and M. J. Voss, "Flow Graphs, Speculative Locks, and Task Arenas in Intel Threading Building Blocks," The Parallel Universe, no. 18, pp. 15-26, 2014.
  7. Intel, "async_node Template Class," [Online]. Available: https://www.threadingbuildingblocks.org/docs/help/reference/appendices/community_preview_features/flow_graph/async_node_cls.htm. [Accessed 26 March 2016].
  8. P. Borovska and L. Milena, "Token-Based Adaptive Load Balancing for Dynamically Parallel Computations on Multicomputer Platforms," in CompSysTech, Ruse, Bulgaria, 2007.
  9. https://stacresearch.com/INTC160428
  10. https://stacresearch.com/INTC160314
  11. https://stacresearch.com/INTC151028
  12. https://stacresearch.com/NVDA141116
  13. https://stacresearch.com/IBM150305
For more complete information about compiler optimizations, see our Optimization Notice.