Matrix Vector Multiplication and Multi-threading Benefits

by Kiefer Kuah


Dual-core and Quad-core processors are fast becoming the staple in desktop and mobile computing. In this article, performance data comparing single-threaded and multi-threaded matrix-vector multiplication are presented. In addition to multi-threading the routine, it was also optimized using Single Instruction Multiple Data (SIMD) instructions. The different implementations were tested on an Intel® Core™ 2 Extreme quad-core processor, QX6700. Measurable speedups were observed; the highest speedup was 20.8x. Next, the impact of increasing the data set to sizes beyond the last level cache was also characterized.


The multiplication of a matrix and a vector is a common operation in applications such as in the skinning and physics code of 3D graphics games. We investigated a few ways to write the code for this operation and assess the performance of each version on a 2.66GHz Intel® Core™ 2 Extreme quad-core processors.

Figure 1. Various ways of implementing this matrix-vector multiplication were investigated

| m00 m01 m02 m03 |             | v0 |

| m10 m11 m12 m13 |      *      | v1 |

| m20 m21 m22 m23 |             | v2 |

| m30 m31 m32 m33 |             | v3 |


Six different versions of the code were written. The first version was written using C++ code. It involved two nested loops, iterating through each element of the data sets. This version would be used as the reference by which the performance of other versions would be measured. This version was then multithreaded using Windows* threading functions.

The SIMD versions were written in assembly and they operated on 4 floating point data in each loop. One version assumed that the data was in the array of structure (AOS) format while the other assumed the data was in the structure of array (SOA) format. Figure 2 shows an example of the structure of array construct. Two other versions were subsequently derived from these SIMD versions by converting them to multi-threaded code.

Figure 2. An example of the 'structure of array' format. An array of this structure was declared to obtain an array of 'structure of array' to store all the vectors.

struct vectorsoa


float w[4];

float x[4];

float y[4];

float z[4];




The time, in milliseconds, to compute the multiplication using the different versions was recorded in Table 1. The test was done using 40,000 randomly generated vectors and matrices. This number was chosen to ensure that the data set would fit into the 2 x 4MB L2 cache. This gives a better assessment of the different code without having to account for the impact of cache misses.

Using the data for the C++ version as the baseline, the speedup for the 4-thread SOA SIMD version was the highest at 20.8x, as shown in Table 1. This was a significant gain. If the data was not arranged in the SOA format, the gain from using SIMD instructions and threading was 13.56x. Speedups from multi-threading alone scaled almost linearly with the number of threads. The benefit of using SIMD instructions was also evident in the data. Comparing the C++ version and the SIMD version, the speedup was 3.76x.

Table 1. The time it took to run the different versions for 100 iterations on an Intel® Core™ 2 Extreme quad-core processor was recorded. The highest speedup was 20.8x. The fourth column shows only the threading speedups, excluding speedups from the other optimizations. Using SIMD instructions improved performance by 3.76x over the C++ code.

Elapsed time, msSpeedup from single threaded C++ codeSpeedup from threading
C++, 1 thread132.081.001.00
C++, 2 threads68.281.931.93
C++, 4 threads34.583.823.82
SIMD, 1 thread35.143.761.00
SIMD, 2 threads17.907.381.96
SIMD, 4 threads9.7413.563.61
SOA SIMD, 1 thread20.746.371.00
SOA SIMD, 2 threads9.7613.532.13
SOA SIMD, 4 threads6.3520.803.27


Next, the impact of the size of the data set was investigated. As the number of vectors and matrices was increased to more than what can fit in the 2 x 4MB L2 cache, a significant impact on the speedups was observed, as shown in Figure 3. For the single threaded versions, the impact of the data size increase was observed sooner. Their speedups began to diminish when the data size was 5.49MB. Since they were single threaded, they ran only one core, thus using only half of the total L2 cache that is shared with another core.

Figure 3. As the data size was increased beyond the 2 x 4MB L2 cache, the magnitude of the speedup began to diminish.


It was shown that multi-threading and the use of SIMD instructions with the right data structure could yield speed ups up to 20.8x on the Intel® Core™ 2 Extreme quad-core processor, QX6700. Single threaded functions will utilize only one of the four cores. In order to reap the full benefits of multi-core processors, software applications need to be multi-threaded and should be designed with parallelism in mind. The Intel® Core™ 2 Extreme quad-core processors offer software applications the computational parallelism and muscle that can be used to speed up applications and add new features.

Related Resources


Matrix-Vector Multiplication Code Sample (ZIP file: 2,585 bytes)


For more complete information about compiler optimizations, see our Optimization Notice.

1 comment

todd-bezenek's picture

Thank you for this very useful article!


Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.