Vectorization: A Key Tool To Improve Performance On Modern CPUs

By Evgueny Khartchenko,

Published:01/25/2018   Last Updated:01/25/2018

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).

The Rise of Parallelism

For the past decade, Moore’s law has continued to prevail, but while chip makers have continued to pack more transistors into every square inch of silicon, the focus of innovation has moved away from greater clock speeds and towards multicore and manycore architectures.

A great deal of focus has been given to engineering applications that are capable of exploiting the growing number of CPU cores by running multi-threaded or grid-distributed calculations. This type of parallelism has become a routine part of designing performance critical software.

At the same time, as the multicore chip design has given rise to task parallelism in software design, chipmakers have also been increasing the power of a second type of parallelism: instruction level parallelism. Alongside the trend to increase core count, the width of SIMD (single instruction, multiple data) registers has been steadily increasing. The software changes required to exploit instruction level parallelism are known as ‘vectorisation’.

The most recent processors have many cores/threads and the ability to implement single instructions on an increasingly large data set (SIMD width).

A key driver of these architectural changes was the power/ performance dynamic of the alternative architectures.

• Wider SIMD – Linear increase in transistors & power

• Multi core – Quadratic increase in transistors & power

• Higher clock frequency – Cubic increase power

SIMD provides a way to increase performance using less power.

Software design must adapt to take advantage of these new processor technologies. Multi-threading and vectorisation are each powerful tools on their own, but only by combining them can performance be maximised. Modern software must leverage both Threading and Vectorisation to get the highest performance possible from the latest generation of processors.

Why Vectorise?

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD). For example, a CPU with a 512 bit register could hold 16 32- bit single precision doubles and do a single calculation.

16 times faster than executing a single instruction at a time. Combine this with threading and multi-core CPUs leads to orders of magnitude performance gains.

Implementing Vectorization

There are a range of alternatives and tools for implementing vectorisation. They vary in terms of complexity, flexibility and future compatibility. The simplest way to implement vectorisation is to start with Intel’s 6-step process. This process leverages Intel tools to provide a clear path to transforming existing code into modern, high-performance software leveraging multicore and manycore processors.

Applying Vectorization to CVA Aggregation

The Finance domain provides many good candidates for vectorization. A particularly good example is the aggregation of Credit Value Adjustment (CVA) and other measures of counterparty risk. The most common general purpose approach to calculation of CVA is based on a Monte-Carlo simulation of the distribution of forward values for all derivative trades with a counterparty. The evolution of market prices over a series of forward dates is simulated, then the value of each derivative trade is calculated at that forward date using the simulated market prices. This gives us a ‘path’ of projected values over the lifetime of each trade. By running a large number of these randomized simulated ‘paths’, we can estimate the distribution of forward values, giving both the expected and extreme ‘exposures.’ The simulation step results in a 3-dimensional array of exposures. The task of calculating CVA from these exposures occurs in several steps: netting, collateralisation, integration over paths, and integration over dates.

More Details

Check out this whitepaper (PDF).

Also a complete webinar (on quantifi's site) and associated slide-deck (PDF)


Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804