Counting FLOPs on Knights Landing is not directly supported by the hardware because there is no accounting for the values in mask registers when AVX-512 instructions are counted. Certain capabilities of the Intel® Advisor tools can make up for this lack of direct hardware support.
FLOP/s (floating-point operations per second) is a key way to measure efficiency of the workload or its individual loops or kernels. Measured FLOP/s can be compared against peak floating-point performance of target hardware (normally documented also as FLOP/s value).
While masked computations enable much wider class of codes in terms of vectorization, they also complicate FLOPs accounting (and more generally performance analysis), because traditional ways of measuring FLOPs or efficiencies cannot tell how many vector lanes were actually shut-off during the execution. For example, it is possible that code with significant FLOP/s value does not perform any computations at all, because an all-0 mask was used.
Intel Advisor introduces capability to precisely measure number of FLOPs FLOP/s for user program as a whole, as well as for individual functions and loops. At the same time, it introduces Mask Utilization Profiler for AVX-512 (implemented in the same analysis type as FLOPs profiling). Altogether FLOPs and Mask Profilers make it possible to account both mask-aware FLOP/s, to see the number of effectively executed floating point operations, as well as traditional FLOP/s.
FLOP/s metric tells how many “useful computations” were executed in a given amount of time. It can be used as a good indication of optimization progress; successful vectorization should normally lead to FLOP/s metrics increase, especially when comparing a vectorized version to a scalar one. On the other hand, lower FLOP/s values are often an indication of significant latencies and overall performance bottlenecks.
FLOP/s analysis provides unique insight into the performance landscape, but it still has various limitations:
- Workloads with low mask register utilization may have over counted the FLOPs value. This is addressed in Intel Advisor by implementing mask-aware FLOP/s metric. Be attentive to loops / functions with low mask-aware FLOP/s count, but with a high traditional FLOP/s value; these codes are often good candidates for simplifying control flow (via loops splitting, loop padding) or low-hanging code tuning using compiler hints (#pragma loop count, __builtin_expect).
- “More operations” does not always mean “more useful work”. For example, in order to vectorize traditional SIMD reduction loops, Compilers often have to do some special post-processing, which could be seen as expensive vectorization overhead. Various kinds of vectorization overheads could make the whole vectorization non-profitable, while the total number of corresponding FLOP/s will always be bigger than FLOP/s for original scalar loop. That’s why in order to estimate profitability and speed-ups of highly vectorized codes, Intel Advisor provides special mask-aware “Efficiency” metric, which characterizes quality of vector code generation done by the Intel Compiler. In order to estimate ”Efficiency”, Intel Advisor enriches static compiler SIMD cost and benefit estimates with its own dynamic knowledge of trip counts, mask utilization and memory accesses, providing unique hybrid static+dynamic SIMD code efficiency characteristics (see Figure 10.18). It is highly recommended to analyze both Efficiency and FLOP/s metrics, when estimating optimization progress for highly vectorized codes.
- Workloads with computationally intensive integer data processing will have low FLOP/s, although naturally it does not mean that the code or hardware does not perform well. Typical FLOPs analysis is only applicable to FP-centric HPC codes.
Due to aforementioned and some other limitations, FLOP/s analysis still should be consider as just a single part of the bigger performance characterization picture provided by various components of Intel Advisor, such as CPU time profile, Mask-aware FLOPs Report, Vector Efficiency, Memory Access Pattern and Roofline Analysis Graph. The combination of these components enables deep understanding of the balances between memory- and compute- aspects of user code, at the same time providing a solid foundation for informed optimization advice provided to end-users in Intel Advisor Recommendations feature.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.