Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen9
Published: 12/23/2016 Last Updated: 12/23/2016
Download the code:
Download sample code from here.
Introduction
Matrix multiply is commonly used as a benchmark because it is simple, easily parallelized, and useful. This makes it ideal as a showcase for optimization techniques that can be used in many other applications.
The psuedocode for a basic (square) matrix multiply C=A*B can be written as
for j=1 to n
for i=1 to n
for k=1 to n
C(i,j)=C(i,j)+A(i,k)*B(k,j)
The order of operations is flexible and there are many options for concurrency. It can be easily tiled. It vectorizes well. For many processors (including Intel processor graphics execution units) the inner loop can be implemented using fused multiply add. Because this standard algorithm is O(N^{3}) it is easy to make the number of operations require measurable time no matter how much compute resources are available.
These characteristics make matrix multiply an ideal starting point for optimization, to help you use the often untapped potential of Intel(R) Processor Graphics GPUs. The performance boost available can be significant, especially for the largest GPU options. The processor graphics design is scalable, with a slice architecture allowing flexibility to fit a range of price, power, and performance requirements. The Iris™ Pro 580, the largest GPU currently available for the Core line with 72 EUs, illustrates how important the EUs are as a component of overall performance.
The other processor lines (Xeon, Atom, etc.) also have options including processor graphics. Making use of the EUs is an important part of taking advantage of the full capabilities of the processors which have them. We hope that matrix multiply can be a starting point to understanding the level of performance boost possible, and that it gives some hints for how to approach taking advantage of Intel Processor Graphics hardware for your applications.
Optimizations
The code included in this sample shows a naïve, unoptimized version for comparison. It also shows the results of extensive research to find an optimal implementation of this algorithm. As such it is missing many of the intermediate steps developers might proceed through to achieve better performance. These would be excellent topics for future articles. The optimizations included will be only briefly summarized here.
Note: these numbers are not official benchmarks. They are a snapshot summary of the output of running the sample on an Intel® Core™ i7-6770HQ processor with CentOS 7.2.1511 and the SRB4 driver multiplying two square 1024x1024 matrices. The intent is to show the magnitude of relative speedup observed.
Algorithm | GFLOPS | Speedup |
Unoptimized | 45 | |
L3_SIMD_4x8x8 | 692 |
15.5x |
MediaBlockRW_SIMD_2x32 | 825 |
18.5x |
MediaBlockRead_SIMD_1x16_2_fp16 | 1489 |
33.3x |
The main message here is that optimizations can make a huge difference. The next paragraphs summarize some of the optimization approaches used to achieve these speedups.
More operations per work item. Computing a single scalar output per work item can be inefficient in terms of thread overhead. This overhead comes from multiple places, some of them hiding in plain sight. For example, input and output buffer address calculations for each element can be expensive. These "hidden" per-thread costs can be spread out by launching fewer work items. Improved unrolling opportunities with more work per work item is another route to saving work and improving performance.
Switching to SIMD can help, though the reasons why may be unintuitive. The compiler includes a "scalarization" pass before auto-vectorization, so you don't necessarily gain an advantage from using vector types. Gen instructions generated can in many cases be close to identical for scalar and vectorized OpenCL source code for the same algorithm.
One of the main advantages to expanding work items to larger "tiles" is memory I/O. This is especially true for memory bound kernels. However, no matter what the ratio of compute to I/O, memory access strategy is always important for kernel performance. Memory loads/stores are an important exception to the scalarize-then-autovectorize approach used by the compiler. The compiler and driver do a great job with automatically coalescing memory I/O, but there is no substitute for efficiently architecting data flow – which may include a combination of buffers and images.
Of course the push toward larger work items must be balanced. If the number of work items launched is too small it will be harder to keep all EUs busy. More complex kernels may also increase the use of shared resources like registers and local memory, limiting opportunities for concurrency. However, the simplest implementation many kernels start with is often too small.
Half floats. Intel® Processor Graphics Architecture Gen9 adds support for 16 bit (half) floats. These have much lower precision than standard 32 or 64 bit floating point representations. Where this is acceptable, this new data type provides great opportunities for improved floating point performance.
- Reduced data I/O: 2x more data points for the same transfer size vs. 32 bit floats.
- More operations per cycle: both FPUs can compute 8 half float operations per cycle vs. 4 for 32 bit floats.
Since standard C/C++ does not include a 16 bit float type, data must be converted to half format (usually stored in unsigned short integers) before access by the GPU. This sample uses CPU intrinsics for this step. More info can be found for them here: https://software.intel.com/en-us/node/524287.
Where a large range of unique integer values is not needed, similar optimizations are available for all processor graphics architecture generations using signed/unsigned short and char types.
Subgroups. Subgroups enable coordination and data transfers at finer granularity than work groups, often with a closer match to underlying hardware. Both Intel and Khronos offer subgroup extensions with different APIs. Workgroups are composed of one or more subgroups. Work items in a subgroup execute together, while this is not guaranteed for a workgroup. This means work items in the same subgroup can share data without using local memory and work group barriers, and they have more direct access to the hardware. The broadcast functions used in this sample allow efficient data sharing between work items in the subgroup. The subgroup block data I/O shown here represents the close-to-the-HW performance which the compiler and driver otherwise attempt to achieve automatically.
Out of order queue execution: The default in-order execution model means that only one kernel can run at a time, even if EUs are left idle. Each kernel must run to completion before the next one begins. For processors with larger integrated graphics (such as the 72 EU Intel® Core™ i7-6770HQ), the in-order approach becomes more limiting. It works well when a lot of work needs to be done by the same kernel launch, but the amount of work needed to keep the GPU fully busy increases with the capability of the hardware. The out of order model means more opportunities to fully use the hardware when there are more kernels in the queue, since kernels enqueued separately (and working on separate data) can start on available EUs without waiting for each other to finish. Dependencies between kernels can be managed by the application with events.
Example results:
These numbers are not official benchmarks. They are a snapshot from an Intel® Core™ i7-6770HQ processor with CentOS 7.2.1511 and the SRB4 driver intended only to show relative speedup from out of order queues. Units are GFLOPS as reported by the sample output.
Matrix Size: | 64 | 128 | 256 | 512 | 1024 |
---|---|---|---|---|---|
in-order SIMD_4x8x8 | 24.1 | 123.7 | 456.7 | 595.7 | 671.3 |
out-of-order SIMD 4x8x8 | 58.0 | 252.1 | 536.3 | 699.5 | 692.0 |
OOQ Speedup | 2.4X | 2.0X | 1.2X | 1.2X |
The sample queues up thousands of matrix multiply kernels, so out of order mode allows a new kernel to start without waiting for the previous one to complete. Effects are most beneficial for small matrix sizes, but improvements are measurable even when there is much more work to do.
This optimization is easy to implement:
cl_queue_properties properties[]={CL_QUEUE_PROPERTIES, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,0};
queue = clCreateCommandQueueWithProperties(context, device, properties, &err);
Running the sample
If run without parameters, the results should look like this.
$ ./gemm
# device name: Intel(R) HD Graphics
# device slm size: 65536
# device max work group size: 256
# Max compute units (GPU): 72
# Max clock freqency (GPU): 950.000000
# Peak float perf (GPU): 1094.400000
# build options: -cl-mad-enable -cl-fast-relaxed-math
# matrix size: 512x512x512
# name time(ms) GFLOPS Efficiency
Unoptimized 5.8 46.1 4.2 %
L3_SIMD_4x8x8 0.5 593.3 54.2 %
MediaBlockRW_SIMD_2x32 0.4 619.7 56.6 %
MediaBlockRead_SIMD_1x16_2_fp16 0.2 1157.5 105.8 %
Results here are for a 512x512 matrix. You can select a specific algorithm as well as a different matrix size from the command line.
Algorithm choices:
- all
- unoptimized
- SIMD_4x8x8
- SIMD_ImagesRW_2x32
- SIMD_Images_1x16_2_fp16
Matrix sizes allowed are powers of 2, >=64: 64, 128, 256, 512, 1024, ... Rectangular sizes can also be selected but each dimension must follow the same rules.
To run all tests with 2 64x64 matrices:
$ ./gemm all 64
To run the half-float kernel only, multiplying 2 1024x1024 matrices.
$ ./gemm SIMD_Images_1x16_2_fp16 1024
To enable out of order queues, uncomment this line at the top of the code and recompile:
//#define USE_OUT_OF_ORDER_QUEUE
More test iterations can lead to more stable times and will ensure that the GPU has reached its highest frequency, but also increase total runtime. The # of runs for each kernel can also be adjusted by changing the define at the top of the code.
Conclusion
This article provided an overview of the optimizations showcased in the attached matrix multiply sample. While matrix multiply is useful in a wide range of applications, the main goal is to provide an example using best known methods for optimization. Many of these optimizations are generally beneficial and we hope they can be applied to your kernels too.
Many thanks to Insoo Woo, Krzysztof Laskowski, Ben Ashbaugh, and Robert Ioffe for code contributions and concepts.
References:
/content/dam/develop/external/us/en/documents/the-compute-architecture-of-intel-processor-graphics-gen9-v1d0-706450.pdf
/content/www/us/en/develop/videos/optimizing-simple-opencl-kernels-sobel-kernel-optimization.html
/content/www/us/en/develop/articles/introduction-to-gen-assembly.html
/content/www/us/en/develop/articles/sgemm-for-intel-processor-graphics.html
/content/www/us/en/develop/articles/box-blur-filter-using-intel-subgroup-extensions-in-opencl.html
http://www.iwocl.org/wp-content/uploads/iwocl-2015-poster-accelerating-sgemm-intel.pdf
http://www.iwocl.org/wp-content/uploads/IWOCL2016-employing-ooqueues.pdf
A previous version of this sample (December 2016) can be found here.
For more complete information about compiler optimizations, see our Optimization Notice.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.