What's EMON and Where to download it?

What's EMON and Where to download it?

What's EMON? Is it a tool and a part of VTune? I want to download and evaluate it. Where can I find it? Can you help me?

-Xie Bo
Email:bxie@sjtu.edu.cn

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

As the IDS paper on Performance Monitoring Tools shows in the summary, emon is a tool for logging event counters against a time base. Unlike Vtune, emon doesn't permit identification of the events with a group of instructions ("hot spots"), and it doesn't (to my knowledge) have customer support or a well developed user interface.

 

I am using Emon to profile my code and it working well in Sandy bridge and ivy bridge....

On haswell i can't find counters (to calculate flops of application) :

FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE

FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE

SIMD_FP_256.PACKED_DOUBLE

FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE

FP_COMP_OPS_EXE.SSE_PACKED_SINGLE,

SIMD_FP_256.PACKED_SINGLE"

Do you have any suggestions to calculate flops on haswell using emon ?

 

 

My understanding is that emon is currently undocumented but might become a tool we could use.  I'm trying to figure out how I can turn it on and off from inside a program, but so far, no luck.  It can't be a very big program.  It is only 200kB of instructions total, and when disassembled it is only 3MB of assembly code.  The origin C++ code would not appear to be very big.  There are calls to stdio routines, calls to functions with OSI_ prefix, which I assume are operating system interface calls meant to be portable across windows/linux/mac for things like file time, launch child process, wait for child, DLL open.  Then calls to functions starting with PISE_ and functions that start with SMRK_

 

re: haswell

 

Haswell, and coming Intel processors, do not have hardware counters for floating point anymore.  the Ivy-Bridge and Sandy-Bridge counters were problematic (incorrect counting in many many cases).  On Haswell perf indicates that there are no native floating-point hardware counters.

 

 

when I readelf -f t  on emon I can see that it is a small wrapper program around a few calls to the Intel libraries

 

 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libamplxe_tbrw_smrk_1.2.so]
 0x0000000000000001 (NEEDED)             Shared library: [libamplxe_tbrw_1.2.so]
 0x0000000000000001 (NEEDED)             Shared library: [libamplxe_sampling_pax_3.15.so]
 0x0000000000000001 (NEEDED)             Shared library: [libamplxe_sampling_pise_3.15.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc_osi.so]
 0x0000000000000001 (NEEDED)             Shared library: [libamplxe_sampling_utils_1.0.so]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

PaX is the loadable kernel module Intel provides for linux to allow VTune to access the hardware counters, along with sep3_15 and vtsspp.

 

If I had an API to call these functions in these libraries I could do some remarkable performance tools.  As it is, perf and PCM don't work since they require root access, or a different kernel patch and setting the perf_event_paranoid system flag to 0 or lower, which admins don't seem to like to do, but can't explain why they do not like this except mumbling about "...timing attacks...." 

 

Hi Brian:

A couple of things.  First, true that Haswell does not have floating point counters.  I believe that was an errata.  It's not true that future processors won't have them.  For example, from the Broadwell tuning guide, you can see the FP Arithmetic metric is calculated using this formula of events:

Formula: FP x87 % = INST_RETIRED.X87 / UOPS_RETIRED.RETIRE_SLOTS FP Scalar % = ( FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE ) / UOPS_RETIRED.RETIRE_SLOTS FP Vectory % = ( FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE ) / UOPS_RETIRED.RETIRE_SLOTS

So, there are plenty of FP counters in Broadwell. ;)

Regarding accessing the VTune Amplifier drivers because you don't have root access, I'm afraid that is just not something we are able to publish, even if we wanted to.  I mean, we don't have any documentation because we never intended for it to be used that way, and it would expose our IP.  Your best bet is to figure out how to use the command line to run the collection and then use the pause/resume APIs to only collect during specific periods of activity.  You can use the "sample counts" times the SAV (sample after value) to determine how many events occurred during the different intervals.  You can also use the Event APIs to mark regions and maybe use the command line to dump data collected between those markers (I've not tried that before, so I don't know if the command line supports it - look at the "filter" and "group" options).  However, remember that VTune Amplifier's use of the event counters is "sampling-based" and not "counting mode" like emon.

Regards,
MrAnderson

Quote:

MrAnderson (Intel) wrote:

 

Regarding accessing the VTune Amplifier drivers because you don't have root access, I'm afraid that is just not something we are able to publish, even if we wanted to.  I mean, we don't have any documentation because we never intended for it to be used that way, and it would expose our IP.  Your best bet is to figure out how to use the command line to run the collection and then use the pause/resume APIs to only collect during specific periods of activity.  You can use the "sample counts" times the SAV (sample after value) to determine how many events occurred during the different intervals.  You can also use the Event APIs to mark regions and maybe use the command line to dump data collected between those markers (I've not tried that before, so I don't know if the command line supports it - look at the "filter" and "group" options).  However, remember that VTune Amplifier's use of the event counters is "sampling-based" and not "counting mode" like emon.

 

 

Good news about Broadwell.   It sounds something like the formula we used on Ivy-Bridge.  It would be good to verify that the formula is correct on a variety of  codes comparing SDE and the FP counter formula (particularly highly vectorized codes.  I'm pretty sure your forumula is wrong since it counts FMA as 1 INST_RETIRED, but it is not just FMA counting that goes awry).  that's how we discovered counting errors on the Cray XC30 platforms.   We've had counting errors for FP on Intel since Nehalem.

 

As to sampling vs counting, this is precisely what I need and exactly what is not working for me.  I have been having discussions with Dmitry about what is and is not possible.  While VTune responds to the ittnotify calls, it seems emon ignores __pause and __start.  I have verified this with a few benchmark programs.    Hence I can't attach performance to code regions, and my uncore counters are swamped by warm-up traffic. 

Hence my plea to open source just emon.  The program itself can't be very big and all the intellectual property is vtune libraries and modules.   If possible I could work under an NDA that I'm in with Intel already with the Department of Energy and hand the source back to you once I've upgraded emon to respect ittnotify calls.  ittnotify is a public API.  That way it can be checked for security and you still don't open up your library APIs.

 

Brian VS

Lawrence Berkeley National Lab

Computing Research Division

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today