As far as I understand during execution of packed AVX instructions the vector can be filled just partly. Is there a way to determine whether a vector was completely filled or nor? Or is it correct to assume that compiler does its job well and the cases when the vector is not filled occur rearly (e.g. when we are out of data in the end of the loop)?
Interpreting the AVX counter results
Can you use VTune Amplifier XE 2011 to do Event based sampling, with PMU events named
SIMD_FP_256
?
Review countnumber to know if the results are under expectation.
Event Name Extension Mask Definition Description Counter Counter (HT off)PACKED_SINGLE
0x01
This events counts the number of AVX-256 Computational FP single precision uops issued during the cycle. Note: Packed AVX-256 can be counted as one, and will count for SIMD FP 128.
0,1,2,3
0,1,2,3,4,5,6,7
PACKED_DOUBLE
0x02
This event counts the number of AVX-256 Computational FP doube precision uops issued during the cycle. Note: Packed AVX-256 can be counted as one, and will count for SIMD FP 128.
0,1,2,3
0,1,2,3,4,5,6,7
Yes, I've done the profiling using VTune. The thing is that I'm analyzing the performance of a huge application. In particular I'm trying to understand if the code uses many FP operations and if it has been vectorized successfully. In particular I got the following result for one of the runs: CPU_CLK_UNHALTED.REF_TSC 4,983,560,000,000 CPU_CLK_UNHALTED.THREAD 5,670,360,000,000 FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 358,000,000,000 FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE 1,164,920,000,000 FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE 21,200,000,000 FP_COMP_OPS_EXE.X87 223,200,000,000 INST_RETIRED.ANY 7,926,840,000,000 And the counters SIMD_FP_256 are all zeroes. I've also measured the HPL code and got the following results: CPU_CLK_UNHALTED.REF_TSC 2,675,264,000,000 CPU_CLK_UNHALTED.THREAD 2,922,426,000,000 FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 2,816,000,000 FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE 18,080,000,000 FP_COMP_OPS_EXE.X87 460,000,000 INST_RETIRED.ANY 7,581,522,000,000 SIMD_FP_256.PACKED_DOUBLE 4,582,812,000,000 What I don't understand is how to interpret the results. What is the difference between FP_COMP_OPS_EXE and SIMD_FP_256? And is it justified to to say that each increment of the counter means that actually 4 flop were executed (for DP)? And during one processor cycle there may occur 2 increments (one for add and one for mul)? So any clarifications on the subject would be appreciated!
FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 358,000,000,000; which counts SSE & AVX-128 FPcomputational double precious uops issued, only
Is it correct that operations that count in FP_COM_OPS_EXE are a subset of operations counted by SIMD_FP_256? And by subtracting the former from the latter I get the number of operations with 256-bit operations only?
I think that the answer is "Yes", result for AVX-256 only:-)



