Bloated Instruction counts in SDE as compared with that from HW PMC 0xC0

Bloated Instruction counts in SDE as compared with that from HW PMC 0xC0

I have noted in multiple (though infrequent but freqent enough) circumstances that the instruction counts for execution of a binary in SDE and that reported by PMC 0xC0 differ by ORDERS of magnitude.  I just ran a version of hmmer compiled with Intel 14.0 and SDE is reporting to me (v5.38 of SDE and it's run with sde -mix -top_blocks 3000 upon a Haswell system) that hmmer took 60 Trillion instructions to execute.  I know that number is bogus since in Open64 it only took 1.05 Trillion to execute as reported by the PMCs.  "perlbench" is also reporting it took 5.6 T instructions to execute in SDE, but likewise the PMCs on HW reported that only 2.1 T instructions were executed in Open64.  I don't think Intel14 is taking more than 2x the # of instructions.  Something is antenuating the instruction counts reported by SDE in both "perlbench" and in "hmmer".   I've also noted this happening in GCC v4.6 in the same benchmarks.

Does anyone have an idea as to what's happening in SDE and why I'm observing these bloated instruction counts.  I thought I'd report it.. so as to make you aware of the issue and seek a solution.. thanks..

perfwise

19 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di Mark Charney (Intel)

I could not really follow your note. First suggestion: please update to the latest SDE http://www.intel.com/software/sde . Possibly we fixed some bugs since the version you are using.

Are you using parallelized stuff?  If you are using parallelized stuff, you are going to get slow-downs from JIT'ing that will often show up as increased instruction counts for spin loops, etc.

Mark,

I'm using SDE 5.38.  I'm running SPEC FP 06 binaries built with Intel 14.0.0 through sde and generating mix instruction output using "sde -mix -top_blocks 5000 -- exe" and I'm noting that in some benchmarks, for whatever reason I'm getting obscene instruction counts for these binaries in "hmmer" and in "perlbench".  I've also previously observed this with binaries for these same benchmarks built with GCC version 4.6 using just the -O2 compiler flag.  

These binaries are not parallelized, they are serial/1T binaries.  I'm just inquiring whether you've observed similar observation/issues in the past.  I could maybe give you a binary test to try, if you like.  I'll try your latest SDE.. but are you clear now as to what I'm saying.  I can measure the instrucitons executed either via HW by PMC 0xC0 and I routinely compare that with SDE and get good agreement.  Unfortunately.. something is going on in perlbench and hmmer which causes a very large discrepency.

Perfwise

Ritratto di Mark Charney (Intel)

If you use the performance monitor counters to count the SDE runs, you'll be counting a whole lot of stuff (JIT, analysis) that you do not want to count.  I'm guessing you are not doing that though since you say you often get good agreement. Just have to rule that out.

Another option is that the compilation options you are chosing are using different code-paths when running under SDE because SDE emulates CPUID, but I would not expect as much difference as you are observing.  

FWIW, the fastest SDE icounting option is just to use "path-to-kit/sde -icount -- yourapp" and look in the sde-icount.txt file. That should be close to the output from the very bottom of the mix output file, modulo normal  run-to-run variation due to stack alignment and loader timing loops, etc.  But this might be another sanity check you could try.

% kits/current/sde -icount -- /bin/ls
...

% cat sde-icount.txt

$$ TID: 0 ICOUNT: 517796
# $eof

% kits/current/sde -mix -- /bin/ls
...
% tail -3 sde-mix-out.txt
*total 517797

# END_GLOBAL_DYNAMIC_STATS

Mark,

    Just downloaded the latest version. I'll update you later today as to what I observe.

    I think you mis-understand what I'm saying though.  I do 2 runs of hmmer.  In 1) I don't use SDE, but simply run it, collect PMC stats running it and measure only PMC 0xC0, only that PMC, and then get a count of the total number of instructions executed by the binary.  This count doesn't use SDE in any manner.  It's the reference.  in 2) I use SDE and get the count using the -mix -top_blocks and that's probably similar to what you're doing, since at the bottom of that file there's the "GLOBAL_DYNAMIC" count you are referring to above.  I compare those 2.   So comparing 2) to 1).. i'm saying something is very far off as to what SDE 5.38 reported vs what real hardware is reporting.  

Perfwise

Mark,

    I utilized the latest SDE (v6.7) and ran "perlbench" compiled with Intel 14.0.0 through it.  I observed the following instruction counts through the 3 separate executions of "perlbench" in the SPEC06 v1.2 harness:

4,881,901,329,955

317,976,374,491

5,618,076,436,804

then I ran it through "perf" and measured how many instructions are executed in PMC 0xC0, and that number for the whole application is:

1,893,113,445,300

So i'm still observing this over counting in PIN, as compared with that from PMCs in "perlbench" and I'd venture to say it's still happening in "hmmer" as well.

Any new ideas.  Using -mix and -icount delivered the same results in this test.

Perfwise

Ritratto di Mark Charney (Intel)

Hi, I sure hope you left a digit out of your 2nd number (317B)  else that looks like an incomplete run. Still surprised to see that much variance: 4.8T to 5.6T instructions for the 1st and 3rd measurements. How are you running SDE via runspec? Are you using the "submit" mechanism in the cfg file? 

Not sure how you can get icount to agree when there is that much variance. Can you elaborate?

According to the SDM, the PMC 0xC0 counts rep-string operations as one instruction. In constract, SDE (Pin really) counts each iteration of a rep-string operation as one instruction. You could look in the mix histogram to see how many rep-string ops there were; look for "*category-STRINGOP" in the final histogram.  That might account for some of the difference. 

Mark,

    When I run and collect the # of instructions via PMC, I'm running "perlbench and hummer" in the harness.  When I run SDE, since sde only analyzes what's immediately provided in the command line argument, I am running it from the "run" directory in the SPEC06 harness.  I looked at the # of string ops, which is something I had also considered in the past, and in my run the # of string ops reported in the -mix output was negligable (a very small % of the total number).  So.. how do we proceed or do we.  I could provide you the necessary files for you to replicate this, if you so wish.  Let me know.  It just seems to me this is something worth understanding how we can get 2.5x more instructions reported in perlbench and in hmmer we're talking a number like 20x.  Shouldn't be hard to understand and likely worth knowing.  

Perfwise

Ritratto di Mark Charney (Intel)

I suspect you are having some methodology differences. For programs that have multiple inputs, you must either look at the output for each input separately or sum up each of the SDE and permon outputs. 

You should use the runspec command for everything. Here's how to modify your runspec cfg file:

use_submit_for_speed = 1

submit = path-to-sde-kit/sde  -mix -i -- $command

command_add_redirect = 1

expand_notes = 1

The -i will make the PID (and the parent PID) be included in the output file names so you can get one output file per spec input.

I presume you can do something similar for running the benchmarks under the performance counters.

One of our internal performance guys sanity checked your perfmon number for perl and hmmer and they are close to what we see for SDE when you add up the different outputs. 

Mark,

    When I run SDE.. and in "perlbench" there are 3 separate executions of the perlbench binary, I run each separately and I'm reporting above to you the instruction count for each separately.  I can't understand how this makes a difference in how we're running it.. and i'm confident in the results I provided you.  SDE is reporting 5T instructions just to do the following in the run directory:

sde -icount -- ./EXE -I./lib checkspam.pl 25000 5 25 11 150 1 1 1 1 > checkspam.25000.5.25.11.150.1.1.1.1.out

what is incorrect about that?  When I ran the whole perlbench benchmark with "runspec" the harness is mostly negligable and I'm getting 1.9T.  Something seems very broken with SDE running this binary.

Now, we also used pin v 2.13-61206 and the "inscount2_mb" pintool which is included in the source/tools/SimpleExamples directory.  Replaceing the "sde -icount" command above with "pin -t inscount2_mt.so" yielded the correct # of instructions running in the "run" directory of spec (avoiding the harness completely when running through runspec).  Pin yielded a instruction count of 0.987 T for the command above while SDE yielded close to 5T.  Again, i can provide you the binary and you can run it if you like, up to you.  

So.. pin worked.. but something is not working running the latest SDE.  Why?  That's the question..

Also, understand I'm a ardent fan of SDE and PIN, but I've run across this in the past a couple times and now again and wanted to bring it to your attention.  Hope this helps..

perfwise

Ritratto di Mark Charney (Intel)

Can you try running with the -hsw knob (option)  to sde?  Then it should at least take a similar code path (hopefully the same code path) that the non-SDE runs are taking. 

Mark,

    I ran perlbench in the "run" directory, using -hsw, without any of the -icount or -mix switches, just trying to run the perlbench binary through SDE on the "first test".  The first test executes:

./{perlbench_exe} -I./lib checkspam.pl 25000 5 25 11 150 1 1 1 1

Now, when I just type:

sde -- {command}

it's executing (albeit we know that it's reporting the incorrect # of instructions).  But it's doing something and reporting stdout.

When I type:

sde -hsw -- {command}

I get a core dump.  I'm executing on a Haswell cpu.  Seems SDE is having some issues here.  Again.. do you want me to give you a binary so you can reproduce.  I'm executing a binary build from the Intel 14.0.0 compiler... on a HSW cpu.  

Perfwise

Ritratto di Mark Charney (Intel)

The core dump is interesting. It shows me that your program is running different code when the CPUID is haswell and when it is not hsw. While the core dump could be a bug in SDE (or pin), the core dump could be a bug in the compiled version of your program as well. Either way, the core dump certainly bolsters my theory about why you are seeing different instruction counts with and without SDE.  I'm still surprised the icounts would be that much different based on the CPUID, but until we know the reason, it will remain a theory.

It would be interesting to see the error message you see.  SDE has some tools that sometimes help. For example if one runs with -ptr-check, SDE will check each memop for accessibility before executing it. It takes much longer to run the program but it'll attempt to give you a nice error when a program bug or compilation bug causes a wild pointer reference. We see this a lot in early-enabling of new software. The most common error.

Thank you for the offer, but I am reluctant to take your binaries and run them on my systems inside Intel.  I have my own spec binaries and I will try to run a few of them and see what I can see.

Mark,

    Just to let you know.. these binaries run fine on Haswell hardware.. so I think the fault is either in the cpuid checks in the Intel code path or the SDE itself.  Let me know what you find.  If you like, I can provide the config file I used to compile these binaries.. for your reproduction of the issues.  The only 2 bmks I've seen with this issue are perlbench and hmmer.  The later ran in SDE for 3 days without finishing.. I killed it at that point.  SDE's a great tool, I just want to make it better by getting this issue addressed and thanks for your assistance.

perfwise

Mark,

Any results from your testing.  Were you able to reproduce the instruction bloats in "perlbench" and "hmmer".  Also were you able to find that running with -hsw causes a seg fault?  Here's the output of myself trying to run the first run of the perlbench binary in the spec benchmark:

sde -hsw -- ./{EXE} -I./lib checkspam.pl 25000 5 25 11 150 1 1 1 1 > checkspam.25000.5.25.11.150.1.1.1.1.out
Segmentation fault (core dumped)

perfwise

Mark,

    Just inquiring and keeping this topic alive.. were you able to reproduce my observations?

perfwise

Ritratto di Mark Charney (Intel)

I ran some icc14.0-compiled spec binaries for AVX1 on a sandybridge and AVX2 on a haswell using the "perf" tool for reading the perfmon counters. I also ran them through SDE's mix tool.  Things match pretty well. Here's what I see for the perlbench workload. 

SNB

perf 991198851145

mix 991148701035

HSW

perf 989182775187

mix 988739280216

Mark,

    I sent you a message with more details to reproduce.  My counterpart already found one bug in PIN.. and I'm sure there's something here.  It's unusual that on my haswell running the binary produced with the details I provided you personally.. that sde crashes.  Maybe you can look a bit further.  Either way... I appreciate the effort.. and maybe you can reproduce what I'm finding in perlbench and hmmer.. with gcc v4.6 and intel 14.0.0.  Thanks..

perfwise

Any news as to fixing this issue with -hsw (which you replicated Mark).  Were you ever able to verify that the instruciton count was excessive with the binary I provided?  

perfwise

Accedere per lasciare un commento.