Estimating FLOPS using Event Based Sampling (EBS)

The FLOPS (or flops or flop/s) is an acronym for floating point operations per second and is a measure heavily used in high performance computing. The FLOPS is a common way of measuring the performance and computational capabilities of a given microprocessor.

In this article, you will find out how hardware based Event Based Sampling (EBS) technology can help developers estimate the floating point operations per second executed by their applications. FLOPS will refer to 32 bit and 64 bit floating point operations and the operations will be either addition or multiplication (computational).

The Intel® VTune™ Amplifier XE is a performance analysis tool, which can help the software developers analyze their application to identify algorithmic and microarchitectural performance issues. The VTune™ Amplifier XE uses the processor's Performance Monitoring Unit (PMU) to sample processor events and some of these processor events can be used to statically sample the number of computational floating point operations at execution.


Figure 1: Scalar processing vs. SIMD (Single Instruction Multiple Data) processing

Figure 2: Intel® Architecture integer, floating point, MMX and SSE (Streaming SIMD Extensions) registers.

Note: The figure doesn't show the latest AVX extension and registers.

As Figure 1 and 2 demonstrate, floating point operations can be performed on legacy x87 registers or on SSE registers, depending on how the compiler generates the code. If the floating point instructions are executed on SSE registers, then they can be either scalar or packed operations. Table 1 (below) gives the PMU event names which can be used to statistically estimate the computational floating point operations executed by the hardware. It is a good idea to keep in mind that not all the executed instructions, hence counted by these events, are retired due to speculative nature of the architecture. Therefore, it is possible to experience overcounting of these events.

Processor Generation

Processor Event Names

FP  operations using legacy x87

FP operations using SIMD

Intel® Core™ 2 processor family (Intel®  Core™ 2 Duo/Quad, etc)

X87_OPS_RETIRED.ANY

Packed 64bit

SIMD_COMP_INST_RETIRED.PACKED_DOUBLE

Packed 32bit

SIMD_COMP_INST_RETIRED.PACKED_SINGLE

Scalar 64bit

SIMD_COMP_INST_RETIRED.SCALAR_DOUBLE

Scalar 32bit

SIMD_COMP_INST_RETIRED.SCALAR_SINGLE

Intel® Core™ architecture (Intel® Core™ i7, i5, i3; a.k.a Nehalem)

FP_COMP_OPS_EXE.x87

Packed 64bit

FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION

Packed 32bit

FP_COMP_OPS_EXE.SSE_SINGLE_PRECISION

Scalar 64bit

FP_COMP_OPS_EXE.SSE_FP_SCALAR

Scalar 32bit

FP_COMP_OPS_EXE.SSE_FP_SCALAR

Table 1: PMU events are used to count the computational floating point operations at execution.

The VTune™ Amplifier XE can use any of the events or all of them at the same time to estimate the flops achieved by an application. In order to measure the elapsed time, the CPU_CLK_UNHALTED (a.k.a clockticks) event can be used. If the processor frequency is constant during the measuring period, you can use the clockticks event to calculate the elapsed wall clock time. Please keep in mind that the CPU_CLK_UNHALTED event name might vary by the processors architecture.

Alternatively, CPU_CLK_UNHALTED.REF, which counts the number of reference cycles and is not affected by thread frequency changes, can be used. The difference between the reference clocktick event and clocktick event is that even if a thread enters the halt state (by running the HLT instruction), the reference clocktick event continues to count as if the thread is continuously running at the maximum frequency.

Estimating FLOPS

The FLOPS formula can be given as follows:

FLOPS = ((number of FP ops / clock) * number of total computational FP ops) / Elapsed Time

Elapsed Time = CPU_CLK_UNHALTED / Processor-Frequency / Number-of-Cores.
Note: The cores with non zero CPU_CLK_UNHALTED event count needs to be considered for this formula.

To demonstrate how EBS technology can be used to estimate the FLOPS, a simple multi-threaded matrix multiplication will be used. Each thread in the thread pool executes the following code.

double a[NUM][NUM];

double b[NUM][NUM];

double c[NUM][NUM];

...

slice = (unsigned int) tid;

from  = (slice * NUM) / NUM_THREADS;

to    = ((slice + 1) * NUM) / NUM_THREADS;

for(i = from; i < to; i++) {

for(j = 0; j< NUM; j++) {

for(k = 0; k < NUM; k++) {

// 2 fp ops / iteration: 1 add, 1 multiply
c[i][j] += a[i][k] * b[k][j];

}

}

}

...

The application also reports the flops measured by dividing the total FP operations ( 2 / iteration * NUM * NUM * NUM) with the elapsed time. The elapsed time only includes matrix multiplication part and doesn't include the initialization and thread creation overhead.

In order to collect samples for the relevant code section __itt_pause() (pauses the collection) and __itt_resume() (resumes the collection) APIs are used. Please refer to VTune™ Amplifier XE documentation on how to use the user APIs.

VTune™ Amplifier XE can be configured as follows on Intel® Core™ i7 (x980) based system (3.33GHz, 6 core + Hyper Threading enabled):


Using x87 Registers

The sample application is compiled in released mode (optimization level set to 0x) on a Windows* system using Visual Studio

The application reports the following when analyzed under VTune™ Amplifier XE.


The results below give us insight on how the compiler generated the code.  In this run, we can clearly see that we only collected samples on FP operations using x87.

If we plug the numbers into the formula:

MFLOPS Formula = FP_COMP_OPS_EXE.FP / 1x106 / Elapsed Time

Elapsed time = CPU_CLK_UNHALTED.THREAD / Processor-Frequency / Number-of-Cores

Elapsed Time = 607,652,000,000.00 / 3.33 x 109 / 12 = 15.206 secs

MFLOP = 18,470,000,000.00 / 1x106/ 15.206 secs = 1,214.652 MFLOPS

Using SSE registers

Now, let's look at the same application when SSE registers are used.  If we compile the application using Intel® compiler version 12.0, we see the following results under the VTune™ Amplifier XE.



One thing you will notice right away in the new result displayed is the difference in the function names where the samples are happening.  In the earlier example, we were getting the samples in matrixMultiply function, but now we see the samples in threadPool function.  This is due to inlining (for more information: http://en.wikipedia.org/wiki/Inline_expansion). Drilling down into the threadPool makes this clear.

We multiply FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION event by 2 because two packed double precision floating operations can be performed on 128 bit XMM registers in every clock. For single precision floating point operations, the total count for packed single precision floating operations needs to be multiplied by 4.

MFLOPS Formula = 2 * FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION / 1x106 / Elapsed Time

Elapsed time = CPU_CLK_UNHALTED.THREAD / Processor-Frequency / Number-of-Cores

Elapsed time = (66,178,000,000 / 3.33 x109 / 12 ) =  1.656 secs

MFLOPS = 2 * FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION / 1 x 106 / 1.656 secs =  11,053.140 MFLOPS

Pour de plus amples informations sur les optimisations de compilation, consultez notre Avertissement concernant les optimisations.