Vectorization and Threading Are Crucial to Performance
On modern processors, it is crucial to both vectorize—with Intel® Advanced Vector Extensions (Intel® AVX) or single instruction, multiple data (SIMD) instructions—and thread software to realize the full performance potential of the processor. In some cases, code that is vectorized and threaded can be up to 130 times faster than unthreaded or unvectorized code and much faster than code that is only threaded or only vectorized. That gap is growing with every new processor generation.
Benchmark Source: Intel Corporation. See Configurations. See notes and disclaimers below.1
Threaded plus vectorized can be much faster than either one alone. The gap is growing with each new hardware generation. For more information, see Details.
Intel Advisor gives you data to forecast the performance gain before you invest significant effort in implementation. Implement only the options that have a high return on investment.
Data-Driven Vectorization Optimization and Threading Design
You need good data to make good design decisions. What loops should be threaded and vectorized first? Is the performance gain worth the effort? Will the threading performance scale on larger core counts? Does this loop have a dependency that prevents vectorization? What are the trip counts and memory access patterns? Have you vectorized efficiently with the latest Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions? Or are you using older SIMD instructions?
Vectorization Optimization: Guidance to Speed up your Application
Quickly find what’s blocking vectorization in the locations that matter the most. Intel Advisor sorts your loops by potential gain, makes compiler reports easier to read by showing messages on your source, and gives you tips for effective vectorization. It provides key data like trip counts, data dependencies, and memory access patterns to let you vectorize safely and efficiently.
Find effective optimization strategies using Intel Advisor’s cache-aware Roofline Analysis. It visualizes actual performance against hardware-imposed performance ceilings (rooflines), such as memory bandwidth and compute capacity. If the application does not work optimally on current memory and compute resources, roofline analysis identifies bottlenecks that limit performance and loops that will benefit the most from optimization.