I don't quite understand this term. I looked up the help manual, but I didn't find the explanation on what the verb "retire" mean to an instruction. Can anybody give me a little hint?
Each instruction takes different number of clickticks to be executed. "Instruction Retired" event shows how many instructions were completely executed between two clocktick event samples.
Modern processors execute much more instructions that the program flow needs. This is called "speculative execution".
Then the instructions that were "proven" as indeed needed by flow are "retired".
You can think about "retired" instuctions as only instructions needed by the program flow.
I guess "retired instructions" means those instructions that are acturally executed and completed by CPU. The CPU some kind of prediction about the instructions to be excuted and put them into some place like a "pool". But not all of these instructions will be excuted. Is this correct?
almost correct.but CPU indeed speculatively executes much more instructions. But results are "stored" only for retired instructions.
By using "retired instruction" event VTune gives you much more "fair" count, it does not take into account the execution of "unneeded" instructions.
BTW you might want to experiment and collect event "Instructions decoded" also. Then you can compare it with "Instruction retired" and feel the difference.
What is the formula to get instructions retired for each 'core'?or plz provide any documentation available on this?
There is not a "formula" to get instructions retired for each 'core', but it can be measured using the hardware performance counters. (There are known bugs in this performance counter event on Haswell processors. It is usually correct, but I have some cases that are systematically off by as much as 20%.)
"Instructions Retired" is available on each logical processor using fixed-function performance counter 0 or using a programmable performance counter. Fixed-function performance counter 0 is accessible by reading MSR 0x309, or by executing the RDPMC instruction with the EAX register set to 0x4000 0000.
Programmable counters are also accessible by either RDMSR instructions (kernel only) or RDPMC instructions, with any of the performance counter event select registers (0x186-0x189 on most Intel products) programmed to 0x004300c0, and the the counts returned by the corresponding MSR in the 0xc1-0xc4 range or by executing the RDPMC instruction with the EAX register set to the counter number (0,1,2,3, corresponding to which of the performance counter event select registers you decided to use).
At the whole-program level, Linux systems can count instructions retired using a simple "perf stat a.out" command, but this will aggregate the counts over all of the Logical Processors that participate in running "a.out" (which could be all of them if "a.out" is a threaded code). Root permissions are required to get counts for all processes running in the system (adding "-a" to the "perf stat" command), and in this mode the aggregation of the counts can be inhibited by adding the "-A" option to the "perf stat" command. This is still a very easy "perf stat -a -A a.out" to get instructions retired on all logical processors in the system while "a.out" is running.
FYI: we added a topic aggregating this knowledge to the product help: https://software.intel.com/en-us/vtune-amplifier-help-instructions-retir...