Interpreting the AVX counter results

Interpreting the AVX counter results

As far as I understand during execution of packed AVX instructions the vector can be filled just partly. Is there a way to determine whether a vector was completely filled or nor?Or is it correct to assume that compiler does its job well and the cases when the vector is not filled occur rearly (e.g. when we are out of data in the end of the loop)?

publicaciones de 13 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Can you use VTune Amplifier XE 2011 to do Event based sampling, with PMU events namedSIMD_FP_256?

Review countnumber to know if the results are under expectation.

Event Name ExtensionMaskDefinitionDescriptionCounterCounter (HT off)

PACKED_SINGLE

0x01

This events counts the number of AVX-256 Computational FP single precision uops issued during the cycle. Note: Packed AVX-256 can be counted as one, and will count for SIMD FP 128.

0,1,2,3

0,1,2,3,4,5,6,7

PACKED_DOUBLE

0x02

This event counts the number of AVX-256 Computational FP doube precision uops issued during the cycle. Note: Packed AVX-256 can be counted as one, and will count for SIMD FP 128.

0,1,2,3

0,1,2,3,4,5,6,7

Yes, I've done the profiling using VTune.The thing is that I'm analyzing the performance of a huge application. In particular I'm trying to understand if the code uses many FP operations and if it has been vectorized successfully.In particular I got the following result for one of the runs:CPU_CLK_UNHALTED.REF_TSC 4,983,560,000,000CPU_CLK_UNHALTED.THREAD 5,670,360,000,000FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 358,000,000,000FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE 1,164,920,000,000FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE 21,200,000,000FP_COMP_OPS_EXE.X87 223,200,000,000INST_RETIRED.ANY 7,926,840,000,000And the counters SIMD_FP_256 are all zeroes.I've also measured the HPL code and got the following results:CPU_CLK_UNHALTED.REF_TSC 2,675,264,000,000CPU_CLK_UNHALTED.THREAD 2,922,426,000,000FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 2,816,000,000FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE 18,080,000,000FP_COMP_OPS_EXE.X87 460,000,000INST_RETIRED.ANY 7,581,522,000,000SIMD_FP_256.PACKED_DOUBLE 4,582,812,000,000What I don't understand is how to interpret the results. What is the difference between FP_COMP_OPS_EXE and SIMD_FP_256? And is it justified to to say that each increment of the counter means that actually 4 flop were executed (for DP)? And during one processor cycle there may occur 2 increments (one for add and one for mul)?So any clarifications on the subject would be appreciated!

SIMD_FP_256.PACKED_DOUBLE 4,582,812,000,000; which counts SSE, AVX-128 FPand AVX-256 FP computational double precious uops issued
FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 358,000,000,000; which counts SSE & AVX-128 FPcomputational double precious uops issued, only

Is it correct that operations that count in FP_COM_OPS_EXE are a subset of operations counted by SIMD_FP_256? And by subtracting the former from the latter I get the number of operations with 256-bit operations only?

I think that the answer is "Yes", result for AVX-256 only:-)

What is equivalent of  

SIMD_FP_256.PACKED_DOUBLE.

SIMD_FP_256.PACKED_DOUBLE

 

on haswell ?

 

It appears that all of the floating-point performance counters (with the except of the Event 0xCA "Floating Point Assists") have been removed from the Haswell-based products.

These counters are known to systematically overcount in Sandy Bridge and Ivy Bridge processors whenever the input registers are not ready (e.g., due to cache misses).   I have seen overcounting by anywhere from ~3% to 10x, depending on the average latency for loads feeding into the FP instructions.

We still use these counters on our 6400-node Sandy Bridge system to monitor whether codes are using SSE or AVX, how well the codes vectorize, and whether they are running with 32-bit or 64-bit floating-point arithmetic.  The accuracy is good enough for this classification process, and if we deploy a large Haswell-based system we will have to employ a different approach to get this information.

Intel is certainly aware of the accuracy issues with these counters and is likely to fix the existing problems in some future products.  Section 19.2 of Volume 3 of the SW Developer's Guide (document 324384-053, January 2015) shows that Broadwell gets a few FP events back:

  • Event 0x14, Umask 0x01: ARITH.FPU_DIV_ACTIVE -- cycles that the divide unit is active
  • Event 0xC0, Umask 0x02: INST_RETIRED.X87 -- x87 Floating-Point operations that are retired without generating exceptions.

I have not heard any definitive statements on when improved support for floating-point counts will make it into shipping products.

John D. McCalpin, PhD "Dr. Bandwidth"

>>>As far as I understand during execution of packed AVX instructions the vector can be filled just partly. Is there a way to determine whether a vector was completely filled or nor>>>

​I presume that you are referring to XMMx/YMMx registers. I this case you can see with debugger if specific register is filled with 4 or 8 scalars.

Thank you for your answer

 

>>>and if we deploy a large Haswell-based system we will have to employ a different approach to get this information.

Do you have any idea to get flops on haswell architecture ?

 

>>>Do you have any idea to get flops on haswell architecture ?>>>

Do you mean to count how many GFLOPS were executed?

>>Do you mean to count how many GFLOPS were executed?

yes to count Gflops of application, and number of simple precision and double precision flops were executed

I think that John answered your question.

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya