Floating point operations (FLOP) rate is used widely by the High Performance Computing (HPC) community as a metric for analysis and/or benchmarking purposes. Many HPC nominations (e.g., Gordon Bell) require the FLOP rate be specified for their application submissions.
The methodology described here DOES NOT rely on the Performance Monitoring Unit (PMU) events/counters. This is an alternative software methodology to evaluate FLOP using the Intel® SDE.
- We split the FLOP (Floating point Operations) count into two categories:
- Unmasked FLOP: For Intel® Architectures that do not support masking feature
- Unmasked + Masked FLOP: For Intel® Architectures that do support masking feature
|Example of some Intel® Architectures that do not support masking feature
|2nd gen Intel® Core™ processor family
|3rd gen Intel® Core™ processor family
|4th gen Intel® Core™ processor family
|Example of some Intel® Architectures that do support masking feature
|Intel® Xeon Phi™ coprocessor
- There is some debate on what is considered to be a floating point instruction/operation.
- Provided below is the list of general floating point instructions used in this method: ADD, SUB, MUL, DIV, SQRT, RCP, FMA, FMS, DPP, MAX, MIN (each has many flavors)
- The high level idea is:
- Decode every floating point instruction to identify the following:
- Vector (packed) vs. Scalar
- Data Type (Single Precision vs. Double Precision)
- Register Type Used (xmm – 128 bits, ymm – 256 bits, zmm – 512 bits)
- Masking – masked vs. unmasked instruction
- Use the above information with its “dynamic execution” count to evaluate the FLOP count for that instruction.
Example: vfmadd231pd zmm0, zmm30, zmm1 executed 500 times
- p – packed instruction (vector), without any mask
- d – double precision data type (64 bit)
- zmm – operating on 512 bit registers
- fma – fused multiply and add (2 floating point operations)
- The FLOP count for the above instruction = 8 * 2 (fma) * 500 (execution count) = 8000 FLOP.
- You do not need to parse/decode all of the above for every floating point instruction to evaluate the FLOP count for your application.
- Intel SDE’s instruction mix histogram and dynamic mask profile provide a set of pre-evaluated counters (using the methodology described above + more) that can be used to evaluate the FLOP count on your application.
The next section describes the details on this.
Instructions to Count Unmasked FLOP
- This is applicable for all Intel architectures (Sandy Bridge, Ivy Bridge, Haswell, Knights Landing, etc.)
- Obtain the latest version of Intel SDE here.
- Generate the instruction mix histogram for your application using Intel SDE as follows:
- sde -<arch> -iform 1 -omix myapp_mix.out -top_blocks 5000 -- ./myapp.exe
- <arch> is the architecture that you want to run on(e.g ivb,hsw,knl)
- Compile the binary correctly for the architecture you are running on.
- Supports multi-threaded runs
sde -knl -iform 1 -omix myapp_knl_mix.out -top_blocks 5000 -- ./myapp.knl.exe
- In the instruction mix output (e.g., myapp_mix.out),under the “EMIT_GLOBAL_DYNAMIC_STATS” section, check for the following pre-evaluated counters:
- *elements_fp_(single/double)_(8/16) _masked
- The different counters mean the following:
elements_fp_single_1 – floating point instructions with single precision and one element
(probably scalar) and no mask
elements_fp_double_4 – floating point instructions with double precision and four
elements and no mask (ymm)
elements_fp_double_8 – floating point instructions with double precision and eight
elements and no mask (zmm)
elements_fp_single_16_masked – similar as above but now with masks
(Note: you will see the mask counts only on architectures + ISA that support masking)
- The above by itself is not sufficient since the Fused Multiply and Add instruction (FMA) is counted as 1 FLOP by the above counters.
- “EMIT_GLOBAL_DYNAMIC_STATS” section also prints dynamic counts of every type/flavor of FMA executed in your application. Look for the following:
scalar, double precision, on xmm (128 bit) = 1 element
packed, double precision, on ymm (256 bit) = 4 elements
packed, single precision, on zmm (512 bit) = 16 elements
Other flavors of FMA like VFNMSUB132PD_YMMqq_YMMqq_MEMqq, VFNMADD231SD_XMMdq_XMMq_XMM, etc. may also be present.
- Counting FLOPs
- For each data type (single/double); use the “dynamic” instruction count corresponding to each of the above counters and multiply by the “elements (1/2/4/8/16)” to get the FLOP count.
Intel SDE (Haswell) Instruction Mix output (snapshot) of a Molecular Dynamics code from Sandia Mantevo Suite (look for the below section under EMIT_GLOBAL_DYNAMIC_STATS).
Unmasked FLOP (Double Precision) =
(23513724690 * 1 + 274320019 * 2 + 37317021308 * 4) = ~173.3304 GFLOP
- The above by itself is not sufficient since the Fused Multiply and Add instruction (FMA) is counted as 1 FLOP (see “Step 2” on how to take that into account).
- For Intel® AVX-512 (KNL) instruction mix output you may see “*elements_fp*_masked” counters as well. Counting masked FLOP is covered in the next section.
- Also the “masked” counters above do not specify the actual mask values, so can’t take them into account anyways.
- Step 2
- Taking into account FMA and its flavors
- For each FMA flavor, based on data type (single vs. double), packed vs. scalar, and register type as described above + the “dynamic” instruction count corresponding to each FMA, compute and add the corresponding FLOP “just one more time” to the above FLOP count computed in Step 1.
Intel SDE (Haswell) Instruction Mix output (snapshot) of a Molecular Dynamics code from Sandia Mantevo Suite (look for the VFM* section under EMIT_GLOBAL_DYNAMIC_STATS).
Unmasked FMA FLOP (Double Precision) =
(1728000 * 2 + 47496488 * 4 + 825422220 * 1 + 5733116808 * 1 + 432000 * 2 + 3189961568 * 4 + 4 * 1 + 475482133 * 1 + 1594141168 * 4 + 47064488 * 4 + 1656723 * 1) = ~26.5546 GFLOP
- The multiplier used above (1,2,4..) is based on the type of FMA instructions (PD - packed double on XMM/YMM ; SD – scalar double on XMM …)
- For Intel AVX-512 (KNL/SKL) instruction mix output all FMA (and its flavors) instructions (masked or full vectors) will be marked as masked (e.g., VFMADD132PD_ZMMf64_MASKmskw_ZMMf64_MEMf64_AVX512).
- The next section “Instructions to Count Masked FLOP” will cover that.
- Add the FLOP counted in step 1 and step 2.
Example (for the Advection routine):
Total Unmasked FLOP (Double Precision) = 173.3304 + 26.5546 = 199.885 GFLOP
- If running on an architecture that does not support masking, then you have your total FLOP count (can skip the next section).
- For floating point operation per second (FLOPS), divide the FLOP count computed using the above method by the application run time measured on appropriate hardware.
- On another note, the FLOP count of an application will most likely be the same irrespective of the architecture it is run on (unless the compiler generates completely different code impacting FLOP count for the two different binaries–which is very rare). Thus, to find the FLOP count for an application, compute as described above on Ivy Bridge (or Haswell) with no hardware masking feature and use the same count for other architectures (like Knights Landing, etc.). Thus you do not have to deal with masking at all while evaluating FLOP count.
- But if you still need to evaluate the FLOP count on architecture with masking support, refer to the next section, which describes how to count masked FLOP using the dynamic mask profile feature from Intel SDE.
Instructions to Count Masked FLOP
- Intel SDE has a dynamic mask profile feature that evaluates and prints the number of operations for each executed instruction with a mask.
- Generate the dynamic mask profile for your application using Intel SDE as follows:
- sde -<arch> -iform 1 -odyn_mask_profile myapp_msk.out -top_blocks 5000 -- ./myapp.exe
- <arch> is the architecture that you want to run on (e.g. ivb,hsw,knl).
- Compile the binary correctly for the architecture you are running on.
- Supports multi-threaded runs.
sde -knl -iform 1 -odyn_mask_profile myapp_knl_msk.out -top_blocks 5000 -- ./myapp.knl.exe
The dynamic mask profile is an XML output, with a summary table per thread of the different categories of instructions with and without masking and their total instruction and operation count.
In addition, the mask profile also prints the dynamic instruction count and operation count per instruction.
Summary Table (Dynamic Mask Profile)
Example: Intel® SDE (Knights Landing) dynamic mask profile output (snapshot below):
||Classifies the masked instructions vs. unmasked instructions
||Classifies categories of the instruction (e.g. memory instructions (data transfer), sparse (gather/scatter) and computational (mask)
||Specifies the vector register width
||Specifies maximum number of elements possible in the vector register (with vec-length in third column) and with element size (specified in fifth column)
||Specifies size of element (or data type) in bits (e.g. 64b = 64 bits = 8 byte)
||Classifies based on element type (e.g. fp - floating point vs. int – integer)
||Total Instruction count of each category/type
||Corresponding computation count for the executed instructions of each category/type
||Shows % vector lane utilization for each category/type
- For example in the above snapshot only the rows highlighted have to be used for “masked” FLOP count
- Please note (in your run) you need to mainly look for “masked” instructions with “mask” category and “element_t = fp” for masked FLOP count.
- The “comp_count” number is basically the masked FLOP count.
- But again FMA is counted as only one FLOP in the comp_count counter.
- See the next section on how to take into account masked FMA (to count them as 2 FLOP).
- Per Instruction Details (Dynamic Mask Profile)
- In addition to the “summary table” per thread, the dynamic mask profile also prints the computational count on a per instruction basis.
- Below is a snapshot of it.
- In this case, the masked “vfmadd213pd” instruction has an execution count = 862280 and the computation count = 5052521. Thus all the executions of this instruction are NOT using all the vector lanes in this case.
- In the snapshot above, the “vfmadd213pd” instruction has an execution count = 4000 and the computation count = 32000 (4000 * 8). Thus all the executions of this instruction are using all the vector lanes in this case (no mask).
- Since the summary table accounts for the FMA instructions (and its flavors) as only 1 FLOP, you have to add the computation count for all the masked FMA instructions from the instruction-details (as above) “one more time” to account for 2 FLOP per FMA.
- Counting Masked FLOP
- From the summary table add the “comp_count” value from all “masked” instructions with “mask” category and “element_t = fp”.
- Parse all the FMA instructions with mask, from per instruction-details and add the “computation-counts” to the above sum evaluated in Step 1 one more time.
- Thus you have the total Masked FLOP count.
- As mentioned in the previous section, in Intel AVX-512 (KNL/SKL) instruction mix output all FMA (and its flavors) instructions (masked or full vectors) are marked as masked (e.g. VFMADD132PD_ZMMf64_MASKmskw_ZMMf64_MEMf64_AVX512).
- Thus you can use the dynamic mask profile “instruction-details” to evaluate the “computation-count” for all FMA instructions (masked or unmasked – full vectors).
The above methodology may look a bit overwhelming at first, but the reason for such detailed instructions is so that you can write your own simple scripts to parse the above information. We hope to provide the scripts (currently used internally) to evaluate FLOP count as part of the Intel SDE releases in the future.
Below is a summary of the FLOP count validation on some applications.
- The error margin is basically the difference in count between the Reference count and the FLOP count evaluated using Intel SDE.
- The reason for the difference can be due to reasons like theoretical evaluation vs. actual code generation, instructions counted as FLOP, etc. We have not looked into the details for this difference.
- But you can see the error margin is very minimal.
(using Intel® SDE)
Masking: Even on Intel® AVX/AVX2 (IvyBridge/Haswell) the compiler supports "masking" internally with blends and so forth. Thus in vectorized loops with conditionals there will be unused computations (e.g., compiler computes both the true and false branches and then blends them, throwing away the unused parts). This means that FLOP will be an overestimate of useful computation. Arguably the masked version (KNL/SKL) will be more accurate since the pop count of the mask is exact (assuming the compiler uses masks everywhere).