I'm trying to determinate the number of FLOPs of a program from processors counters.
For instance, I have this subroutine:
subroutine vec_mul(a, b, c)
integer, parameter :: N = 1024*1024
double precision, dimension(N) :: a, b, c
do i=1, 1000000
c(i)=a(i) * b(i)
and when run it I've got ~950,000 SIMD_FP_256.PACKED_DOUBLE events (using perf).
I suppose each of one actually corresponds to 4 operations, so, it's reporting ~3.8 million operations instead of 3 million.
Why is there such a difference?