CPI architecture/port concepts

CPI architecture/port concepts

Hi.

Whatare the concepts for designing ports within a processors to minimze CPI. Normally, for multi-core processors, something below CPI < ~1.0 is targetted for better performances. For Xeon processors, it has beensuggested CPI must be targetted around ~0.75 - 0.50.

Doesthis range would be or have been targetted much lower for Nehelem(45nm) or feature Sandy Bridge(35nm) processors?

The above question is very much aware about - how to improve CPI, the queryknows that procedure from programming point of view and also knows how to improve CPI with various events,here the question is framed from architecture or ports design point of view.

Note: CPI can get as low as 0.25 cycles per instructions with current Intel processors.

~BR

27 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Minimiizing CPI is not a very interesting goal. For one thing, it excludes use of parallel instructions (e.g. vectorization). It is usually possible to use more instructions than would be required by the most efficient way to finish the job, if efficiency were measured by number of clock ticks or instructions. So, you minimize CPI by adding unnecessary instructions, only taking care that the extra instructions are faster than the useful ones.
Minimizing CPI could be even more counter-productive than maximizing threaded performance scaling by choosing a method which maximizes serial execution time. That kind of goal also excludes use of vectorization in combination with threaded parallel execution.
In MPI applications, the low CPI whiich you favor is seen in MPI_Wait spin loops, where CPI is immaterial, unless you see an advantage in maximizing the number of instructions. We do favor spin waits for profiling convenience, but in release configurations, sched_yield() (or some other Windows equivalent) is invoked after a small elapsed time, so as to give up the CPU to other potential use, rather than hogging it so as to execute and discard the maximum number of instructions. It's easy to increase number of instructions by setting the environment variables to increase the spin wait time priior to sched_yield.
Certain vendors' MPI even look at the general load on the system, preferring spin waits (high number of instructions, thus low CPI) when the system is dedicated to one task, but yielding the CPU sooner when other activity is detected. So then you can't get low CPI when the system is busy with multiple tasks, nor would you have any reason to try, when the objective is throughput, not low CPI.

Quoting - tim18
Minimiizing CPI is not a very interesting goal. For one thing, it excludes use of parallel instructions (e.g. vectorization). It is usually possible to use more instructions than would be required by the most efficient way to finish the job, if efficiency were measured by number of clock ticks or instructions. So, you minimize CPI by adding unnecessary instructions, only taking care that the extra instructions are faster than the useful ones.
Minimizing CPI could be even more counter-productive than maximizing threaded performance scaling by choosing a method which maximizes serial execution time. That kind of goal also excludes use of vectorization in combination with threaded parallel execution.
In MPI applications, the low CPI whiich you favor is seen in MPI_Wait spin loops, where CPI is immaterial, unless you see an advantage in maximizing the number of instructions. We do favor spin waits for profiling convenience, but in release configurations, sched_yield() (or some other Windows equivalent) is invoked after a small elapsed time, so as to give up the CPU to other potential use, rather than hogging it so as to execute and discard the maximum number of instructions. It's easy to increase number of instructions by setting the environment variables to increase the spin wait time priior to sched_yield.
Certain vendors' MPI even look at the general load on the system, preferring spin waits (high number of instructions, thus low CPI) when the system is dedicated to one task, but yielding the CPU sooner when other activity is detected. So then you can't get low CPI when the system is busy with multiple tasks, nor would you have any reason to try, when the objective is throughput, not low CPI.

Tim.

yeah, I totally agree with you and I have also seen by having Vectorization(explicit), the CPI for code executed for SMP systems increases and it also makes negative impact on Bus-Utilization. Articles (David Lavinthal) on VTune from Intel recommends to keep CPI within ~ 0.75 - 0.5, is this claim an emprical or a realistic benchmark, not sure?

Did ask (Intel) how they come up with such emprical concepts for CPI and relations for other cycle events, somehow don't see any relation mathematically neither any analytical proofs nor any concrete proofs from them.

But,the query I had asked, what are architectural/ports features in making CPI designed less than ~0.75 for multi-core processors and what would be the targetted range of CPI for Nehelem(45nm) & Sandy Bridge(35nm)?

~BR

Even if vectorization raises CPI from 0.7 to 0.9, it would usually be a clear win, as at least twice as much useful work is accomplished for each instruction. I think I disagree with your idea about "negative impact on bus utilization." Vectorization does tend to run up against bus bandwidth limitations. When care isn't taken to avoid splitting loops, the same data may be required to cross the bus again. In that case bus utilization numbers are useless; it clearly takes longer to do the job over multiple times. But, there is no virtue in reducing bus saturation by slowing down the rate at which useful work is accomplished, if that is what you advocate.
Nehalem is even more dependent on vectorization for good performance. I wouldn't be surprised to see increases in CPI on Sandy Bridge, with wider vector instructions.
I doubt that the designers of the new chips set reduced CPI in general, rather than real performance gains, as a goal. Intel learned from the P4 experience that marketing oriented goals like increasing CPU clock frequency and number of instructions executed for a given job, without regard to useful performance and power consumption, was not the best way to go. I can be derogatory when that message hasn't reached software performance workers.
The bug which began to be dealt with in linux compiler 11.0/081, where vectorized loops with multiple assignments were often distributed (split) down to a separate loop for each assignment, kept CPI artificially depressed. It could require 60% more than the optimum number of instructions to accomplish the job. So, you would never find the performance problem, if you looked only at CPI.

My comments are -
1. CPI ratio is not for first consideration to optimize your code - this is architecture level optimization.
2. For multi-core susytem, first consideration is Parallelism (algorithm level optimization).
Parallelism =CPU_CLK_UNHALTED.CORE/ CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores

Regards, Peter

Quoting - Peter Wang (Intel)

My comments are -
1. CPI ratio is not for first consideration to optimize your code - this is architecture level optimization.
2. For multi-core susytem, first consideration is Parallelism (algorithm level optimization).
Parallelism = CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores

Regards, Peter

Peter,

As qouted by you -

"For multi-core susytem, first consideration is Parallelism (algorithm level optimization).
Parallelism = CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"

(a) In EBS selection of events, I don't see "CPU_CLK_UNHALTED.TOTAL_CYCLES" but the events I see are - CPU_CLK_UNHALTED.CORE.samples, CPU_CLK_UNHALTED.CORE% & CPU_CLK_UNHALTED.CORE.events only. Did I miss something?

(b) Which of above three events having suffix of CPU_CLK_UNHALTED.CORE.xxxxx, should I consider for "CPU_CLK_UNHALTED.CORE"?

(c) My system has Quad Core 5300 m/c., which means it has 2 die having 4 core each, so in total Quad core 5300 has 8 cores. So, I can check parallelism being succesful with above formula as said by you.

Do the value as obtained from above formula will suggest that parallelism has been 100% effective with Quad core 5300?

Could you qoute some examples.

~BR

Quoting - tim18
Even if vectorization raises CPI from 0.7 to 0.9, it would usually be a clear win, as at least twice as much useful work is accomplished for each instruction. I think I disagree with your idea about "negative impact on bus utilization." Vectorization does tend to run up against bus bandwidth limitations. When care isn't taken to avoid splitting loops, the same data may be required to cross the bus again. In that case bus utilization numbers are useless; it clearly takes longer to do the job over multiple times. But, there is no virtue in reducing bus saturation by slowing down the rate at which useful work is accomplished, if that is what you advocate.
Nehalem is even more dependent on vectorization for good performance. I wouldn't be surprised to see increases in CPI on Sandy Bridge, with wider vector instructions.
I doubt that the designers of the new chips set reduced CPI in general, rather than real performance gains, as a goal. Intel learned from the P4 experience that marketing oriented goals like increasing CPU clock frequency and number of instructions executed for a given job, without regard to useful performance and power consumption, was not the best way to go. I can be derogatory when that message hasn't reached software performance workers.
The bug which began to be dealt with in linux compiler 11.0/081, where vectorized loops with multiple assignments were often distributed (split) down to a separate loop for each assignment, kept CPI artificially depressed. It could require 60% more than the optimum number of instructions to accomplish the job. So, you would never find the performance problem, if you looked only at CPI.

Tim,

I had a case where CPI has gone beyond CPI > 1.0 with vectorization after also properly using Compiler calls too. Having or targetting CPI ~ 0.5 - 1.0 is good no doubt which I had been doing with proper tuning of code and did achieve in some succesfully.

Tx for your inputs.

~BR

Quoting - Peter Wang (Intel)

1. CPI ratio is not for first consideration to optimize your code - this is architecture level optimization.
2. For multi-core susytem, first consideration is Parallelism (algorithm level optimization).
Parallelism =CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores
Regards, Peter

CPI was listed as 2nd most important consideration on the .pdf posted here last week. Do you agree that's too high?
Effective use of parallel instructions (vectorization) should be undertaken before threaded parallel. I don't think you meant that, as it's not in your formula

Quoting - tim18
CPI was listed as 2nd most important consideration on the .pdf posted here last week. Do you agree that's too high?
Effective use of parallel instructions (vectorization) should be undertaken before threaded parallel. I don't think you meant that, as it's not in your formula

Tim/Peter.

As qouted "CPI was listed as 2nd most important consideration on the .pdf posted here last week."

Can I have the link of that pdf what Tim is talking about?

I have an article published on VTune which talks on CPi and other optimizations sinerios "Using Intel VTune Performance Analyzer Events/ Ratios & Optimizing Applications" http://software.intel.com/en-us/articles/using-intel-vtune-performance-a...

~BR

Quoting - srimks

Peter,

As qouted by you -

"For multi-core susytem, first consideration is Parallelism (algorithm level optimization).
Parallelism = CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"

(a) In EBS selection of events, I don't see "CPU_CLK_UNHALTED.TOTAL_CYCLES" but the events I see are - CPU_CLK_UNHALTED.CORE.samples, CPU_CLK_UNHALTED.CORE% & CPU_CLK_UNHALTED.CORE.events only. Did I miss something?

(b) Which of above three events having suffix of CPU_CLK_UNHALTED.CORE.xxxxx, should I consider for "CPU_CLK_UNHALTED.CORE"?

(c) My system has Quad Core 5300 m/c., which means it has 2 die having 4 core each, so in total Quad core 5300 has 8 cores. So, I can check parallelism being succesful with above formula as said by you.

Do the value as obtained from above formula will suggest that parallelism has been 100% effective with Quad core 5300?

Could you qoute some examples.

~BR

What I said for event names is for Intel Core 2 Duo processors.

I have no Intel Core 2 Quad process on hand - sorry, I can't provide you corresponding event names.

Regards, Peter

Quoting - Peter Wang (Intel)

What I said for event names is for Intel Core 2 Duo processors.

I have no Intel Core 2 Quad process on hand - sorry, I can't provide you corresponding event names.

Regards, Peter

Other thing I have to recommend that developer can use Intel Thread Profiler to know Concurrency Level (CL) in their code. Based on results, the developer can 1) Change their serial code to parallel; 2) Reduce overheads on sync-objects; 3) Reduce wait time; 4) Balance workload on each thread / processor, etc.

Regards, Peter

Quoting - Peter Wang (Intel)

What I said for event names is for Intel Core 2 Duo processors.

I have no Intel Core 2 Quad process on hand - sorry, I can't provide you corresponding event names.

Regards, Peter

Peter,

I think being an Intel guy, you shdn't answer the query in saying "I have no Intel Core 2 Quad process on hand - sorry, I can't provide you corresponding event names." rather you should direct to some other Intel guy who can answer the query w.r.t Quad Core 5300 and put this qoute " someone from Intel 'll be responding soon here rather making it negative".

Somehow, Intel people should take it seriously of responding the queries on ISN.

Probably, better answers and quick responses are given by non-Intel people in this ISN forum rather by Intel people themselves as myself being here in ISN forum for last 4 months had observed this. Really appreciate those people for their inputs and time.

~BR

Quoting - srimks

Tim/Peter.

As qouted "CPI was listed as 2nd most important consideration on the .pdf posted here last week."

Can I have the link of that pdf what Tim is talking about?

I have an article published on VTune which talks on CPi and other optimizations sinerios "Using Intel VTune Performance Analyzer Events/ Ratios & Optimizing Applications" http://software.intel.com/en-us/articles/using-intel-vtune-performance-a...

~BR

BR,

I think section 2of this articlecorresponds to what Peter was trying to point out: You need to ensure that your application is properly threaded (application level) before you start worrying aboutCPI (architecture level). VTune can assist you in verifying this, if you measure how many of the available clockticks you are actually using. Intel Thread Profiler is another tool that can help you in this stage.

CPI is merely a measure of how well the hardware is able to execute the instruction flow. Looking at the CPI may guide you to the portions of your code where you can take better advantage of the underlying CPU architecture. However, CPI doesn't tellhow useful theexecuted instructions actually are. For example, a different algorithm might result in a way better running time -- and at the end of the day, this is what you care about, isn't it? Similarly, different instructions like vector instructions can improve your running time. If your CPI increases but your running time decreases by switching to vector instructions, who cares?

Having a high CPI just tells you, that there is room for improvement on the architectural level. It doesn't tell you that there isn't any other way to improve the application.

The value 0.75-0.5 is based on experience of what you can achieve in well-tuned CPU-bound applications. In other words, if you already have a CPI of 0.5 for a function, don't be frustrated if you cannot improve on that. On the other hand, if you have a function with a CPI of 10 and it is one of the hot functions in your application and you have exploited all the other means to improve on a system and an application level, then you should look into this.

Kind regards

Thomas

Quoting - srimks

Peter,

As qouted by you -

"For multi-core susytem, first consideration is Parallelism (algorithm level optimization).
Parallelism = CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"

(a) In EBS selection of events, I don't see "CPU_CLK_UNHALTED.TOTAL_CYCLES" but the events I see are - CPU_CLK_UNHALTED.CORE.samples, CPU_CLK_UNHALTED.CORE% & CPU_CLK_UNHALTED.CORE.events only. Did I miss something?

(b) Which of above three events having suffix of CPU_CLK_UNHALTED.CORE.xxxxx, should I consider for "CPU_CLK_UNHALTED.CORE"?

(c) My system has Quad Core 5300 m/c., which means it has 2 die having 4 core each, so in total Quad core 5300 has 8 cores. So, I can check parallelism being succesful with above formula as said by you.

Do the value as obtained from above formula will suggest that parallelism has been 100% effective with Quad core 5300?

Could you qoute some examples.

~BR

Hi,

Today I found a Intel Core 2 Quad machine, and ensure that eventCPU_CLK_UNHALTED.TOTAL_CYCLES exists in this system (Actuallyevents inCore 2 Quad are similar as Core 2 Duo).

Do you use latest product v9.1 Update 1?
I think that 5300 is Core 2 Quad, T5300 is Core 2 Duo, E5300 is Pentium (which has noCPU_CLK_UNHALTED.TOTAL_CYCLES)

You can use vtl command to export supported events name in your system - "vtl query -c sampling" to check ifCPU_CLK_UNHALTED.TOTAL_CYCLES exists.

Regards, Peter

Quoting - Peter Wang (Intel)

Hi,

Today I found a Intel Core 2 Quad machine, and ensure that event CPU_CLK_UNHALTED.TOTAL_CYCLES exists in this system (Actually events in Core 2 Quad are similar as Core 2 Duo).

Do you use latest product v9.1 Update 1?
I think that 5300 is Core 2 Quad, T5300 is Core 2 Duo, E5300 is Pentium (which has no CPU_CLK_UNHALTED.TOTAL_CYCLES)

You can use vtl command to export supported events name in your system - "vtl query -c sampling" to check if CPU_CLK_UNHALTED.TOTAL_CYCLES exists.

Regards, Peter

Peter,

I tried below as suggested but had below message -
-----
$ vtl query -c sampling
VTune Performance Analyzer 9.1 for Linux* build 152
Copyright (C) 2000-2008 Intel Corporation. All rights reserved.

Could not get NUM_PHYSICAL_CPUS value from environment XML file
-----

The processor used by me is "Intel Xeon CPU X5355 @ 2.66GHz", which is 8 core m/c.

Only events which I see with this m/c. on GUI mode are -

CPU_CLK_UNHALTED.CORE%
CPU_CLK_UNHALTED.BUS%
CPU_CLK_UNHALTED.CORE.events
CPU_CLK_UNHALTED.BUS.events
CPU_CLK_UNHALTED.BUS.samples


I did had performed the all EBS events from - Advance Performance Tuning, Basic Tuning, etc. with Vtune-v9.1.

Could you suggest the same thing of Parallelism as qouted by you with "Intel Xeon CPU X5355 @ 2.66GHz" m/c. -

"Parallelism =CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"

~BR

Quoting - srimks

Peter,

I tried below as suggested but had below message -
-----
$ vtl query -c sampling
VTune Performance Analyzer 9.1 for Linux* build 152
Copyright (C) 2000-2008 Intel Corporation. All rights reserved.

Could not get NUM_PHYSICAL_CPUS value from environment XML file
-----

The processor used by me is "Intel Xeon CPU X5355 @ 2.66GHz", which is 8 core m/c.

Only events which I see with this m/c. on GUI mode are -

CPU_CLK_UNHALTED.CORE%
CPU_CLK_UNHALTED.BUS%
CPU_CLK_UNHALTED.CORE.events
CPU_CLK_UNHALTED.BUS.events
CPU_CLK_UNHALTED.BUS.samples


I did had performed the all EBS events from - Advance Performance Tuning, Basic Tuning, etc. with Vtune-v9.1.

Could you suggest the same thing of Parallelism as qouted by you with "Intel Xeon CPU X5355 @ 2.66GHz" m/c. -

"Parallelism =CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"

~BR

Thanks for your detail info of processor - this is a Intel Xeon Processor, a quad-core server processor which was launched before 2 years. This processor is first generation product for Dual Core architecture - is NOT in Intel Core 2 Quad family. That is why there is no event namedCPU_CLK_UNHALTED.TOTAL_CYCLES

In my other thread, I suggest you to use Intel Thread Profiler to know your code'sparallelism.

Thanks, Peter

Quoting - Peter Wang (Intel)

Thanks for your detail info of processor - this is a Intel Xeon Processor, a quad-core server processor which was launched before 2 years. This processor is first generation product for Dual Core architecture - is NOT in Intel Core 2 Quad family. That is why there is no event namedCPU_CLK_UNHALTED.TOTAL_CYCLES

In my other thread, I suggest you to use Intel Thread Profiler to know your code'sparallelism.

Thanks, Peter

You can also get an impression of the concurrency level of you application by looking at the "Sampling Over Time" view in VTune. It allows you to depict which threads are working over time. In order to use it on Linux, you need to use VTune 9.0 update 1 or later and set the environment variable VTUNE_OVER_TIME.

There are certain pitfalls with this methodology, e..g. you might the impression thatall threads are working, but in fact there are waiting on busy looks. But even in case you see that a thread is waiting, it is usually hard to identify using sampling, why the thread is waiting.

The advantage over thread profiler is that the complete system is monitored. This is important if there are several applications involved. Furthermore, the overhead is lower and you can restrict your measurement to a time intervall instead of the complete run.

Kind regards
Thomas

Quoting - Thomas Willhalm (Intel)

You can also get an impression of the concurrency level of you application by looking at the "Sampling Over Time" view in VTune. It allows you to depict which threads are working over time. In order to use it on Linux, you need to use VTune 9.0 update 1 or later and set the environment variable VTUNE_OVER_TIME.

There are certain pitfalls with this methodology, e..g. you might the impression that all threads are working, but in fact there are waiting on busy looks. But even in case you see that a thread is waiting, it is usually hard to identify using sampling, why the thread is waiting.

The advantage over thread profiler is that the complete system is monitored. This is important if there are several applications involved. Furthermore, the overhead is lower and you can restrict your measurement to a time intervall instead of the complete run.

Kind regards
Thomas

Thanks to all for their responses but the query in the beginning as asked was -

"Does this range would be or have been targetted much lower for Nehelem(45nm) or feature Sandy Bridge(35nm) processors?" which basically was to focus what are the experimented RATIO & LIMITS number for commonly used EBS events for Nehalem (Core i7)?

The events RATIO & LIMITS as presented in http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/ does demonstrate Nehalem Core i7 VTune analysis, please confirm?

The number as mentioned in above link if those are emprical or does it carries any justification with analysis done with different test cases of an application executed.

Please confirm?

~BR

Quoting - srimks

Thanks to all for their responses but the query in the beginning as asked was -

"Does this range would be or have been targetted much lower for Nehelem(45nm) or feature Sandy Bridge(35nm) processors?" which basically was to focus what are the experimented RATIO & LIMITS number for commonly used EBS events for Nehalem (Core i7)?

The events RATIO & LIMITS as presented in http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/ does demonstrate Nehalem Core i7 VTune analysis, please confirm?

The number as mentioned in above link if those are emprical or does it carries any justification with analysis done with different test cases of an application executed.

Please confirm?

~BR

Therangefor CPI is still the same for Core i7 (Nehalem) and theoretical limit is still 0.25.The recommendationis based on the measurements with well-tuned CPU-bound applications.

Other ratios in this text do not apply to Core i7 anymore, e.g. the FSB is replaced by QPI withcompletely different events in the uncore.

Quoting - Thomas Willhalm (Intel)

Therangefor CPI is still the same for Core i7 (Nehalem) and theoretical limit is still 0.25.The recommendationis based on the measurements with well-tuned CPU-bound applications.

Other ratios in this text do not apply to Core i7 anymore, e.g. the FSB is replaced by QPI withcompletely different events in the uncore.

Thomas,

Thanks for making a correction w.r.t this article on LIMITS & RATIOS for Core i7. Could you elaborate more on "the FSB is replaced by QPI with completely different events in the uncore". What is uncore here?

True, FSB has been replaced by QPI, so the resultsfor LIMITS & RATIOS numbers will be modified. In this article
45nm Hi-k" has been said,which also refers to Nehalem, please correct?

~BR

Quoting - Thomas Willhalm (Intel)

Quoting - srimks

Thanks to all for their responses but the query in the beginning as asked was -

"Does this range would be or have been targetted much lower for Nehelem(45nm) or feature Sandy Bridge(35nm) processors?" which basically was to focus what are the experimented RATIO & LIMITS number for commonly used EBS events for Nehalem (Core i7)?

The events RATIO & LIMITS as presented in http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/ does demonstrate Nehalem Core i7 VTune analysis, please confirm?

The number as mentioned in above link if those are emprical or does it carries any justification with analysis done with different test cases of an application executed.

Please confirm?

~BR

The range for CPI is still the same for Core i7 (Nehalem) and theoretical limit is still 0.25. The recommendation is based on the measurements with well-tuned CPU-bound applications.

Other ratios in this text do not apply to Core i7 anymore, e.g. the FSB is replaced by QPI with completely different events in the uncore.

Thomas/Peter,

Nehalem (Intel Core i7 ) being one of the Intel latest processor in "45nm Hi-k" silicon based technology. While refering the links for Nehalem, it seems that in Nehalem, the old tradition of having FSB (Front Side Bus) in Intel processor has been removed by incorporating QPI (Quick Path Interconnect).

This article discusses the LIMITS & RATIOS of events w.r.t FSB, so this analysis can't be considerd for Nehalem but this article certainly gives an insights of key EBS events to be taken care while using VTune for profiling an application for a micro-architecture.

The only thing which can be considered from this article about Nehalem is theoritical limit of CPI ~ 0.25 as qouted by you(Thomas), remaining contents of LIMITS & RATIOS can't be considered for Nehalem as mentioned in this article because this article doesn't consider measurement done w.r.t QPI.

I haven't seen any VTune articles (by David Levinthal nor by someone from Intel) distingusihing the VTune profiling analysis w.r.t specific Intel muti-core processors (Intel Xeon, Core Duo, Core2 Duo, Core2 Quad, Core i7, Intel Pentium D and Pentium) for better understanding of profiling using VTune for Intel users. I think Intel should think in publishing arcticles on VTune by being specific to muti-core processors EBS events for better learning for it's users.

~BR

Quoting - srimks
Nehalem (Intel Core i7 ) being one of the Intel latest processor in "45nm Hi-k" silicon based technology. While refering the links for Nehalem, it seems that in Nehalem, the old tradition of having FSB (Front Side Bus) in Intel processor has been removed by incorporating QPI (Quick Path Interconnect).

The LIMITS and RATIOS article has a lot of good information but is already starting to show its age. The use of "45nm Hi-k Intel Core processor" is clearly a way ofwriting a public article about Nehalem before the Intel Core i7 nomenclature was announced, and the article should be reedited at least to use Intel Core i7 (you can find the same thing in someVTune analyzer documentation, or at least the "45nm" part). The focus on CPI as the first thing to look at when doing performance tuning, which has been of value in certain contexts, is less important as a general diagnostic aid. One place where itwasused a lot was in transaction processing,but usually in close association with path length (the average number of instructions in the transaction processing loop) and only because CPI * path length = a measure of time (some number of cycles) which is the average transaction time. The goal then is to minimize both CPI and path length to improve performance. Achieving the minimum CPI, which hasn't changed from Intel Core to Intel Core i7, is an idealized and impossible goal, since it would mean retiring four instructions every cycle (these are all four-wide issue machines), and would require a lot of processing to cover the latency of even a single memory reference. (To be precise, these are four-wide issue micro-instruction architectures, but many instructions translate to a single micro-instruction, so the numbers are usually close.)

Much of this primary focus on CPI has been superceded by newer techniques like looking for stalls coincident to hot spots. And newer techniques are coming online as Performance Monitoring Unit improvements become available, so the picture will evolve. However, the more things change, the more they stay the same. Among the new features of Intel Core i7 is QPI, but it still bears the same general relation to the core architecture even though it connects to different and more things (along with the integrated memory controllers) which complicate the forumulae, but there may be similar ratios (e.g., QPI saturation?) which may have bearing on some kinds of stalls.

I haven't seen any VTune articles (by David Levinthal nor by someone from Intel) distingusihing the VTune profiling analysis w.r.t specific Intel muti-core processors (Intel Xeon, Core Duo, Core2 Duo, Core2 Quad, Core i7, Intel Pentium D and Pentium) for better understanding of profiling using VTune for Intel users. I think Intel should think in publishing arcticles on VTune by being specific to muti-core processors EBS events for better learning for it's users.


I assume you are aware of the reference section of the VTune analyzer documentation called Processor Events and Advice? It may not havethe tutorial level you're looking foror assemble all the formulae together into some complete, cycle accounting whole, but it's a start. More should be forthcoming as we dig deeper and find the time to write about it. Stay on us. We appreciate your enthusiasm.

Quoting - Robert Reed (Intel)

The LIMITS and RATIOS article has a lot of good information but is already starting to show its age. The use of "45nm Hi-k Intel Core processor" is clearly a way ofwriting a public article about Nehalem before the Intel Core i7 nomenclature was announced, and the article should be reedited at least to use Intel Core i7 (you can find the same thing in someVTune analyzer documentation, or at least the "45nm" part). The focus on CPI as the first thing to look at when doing performance tuning, which has been of value in certain contexts, is less important as a general diagnostic aid. One place where itwasused a lot was in transaction processing,but usually in close association with path length (the average number of instructions in the transaction processing loop) and only because CPI * path length = a measure of time (some number of cycles) which is the average transaction time. The goal then is to minimize both CPI and path length to improve performance. Achieving the minimum CPI, which hasn't changed from Intel Core to Intel Core i7, is an idealized and impossible goal, since it would mean retiring four instructions every cycle (these are all four-wide issue machines), and would require a lot of processing to cover the latency of even a single memory reference. (To be precise, these are four-wide issue micro-instruction architectures, but many instructions translate to a single micro-instruction, so the numbers are usually close.)

Much of this primary focus on CPI has been superceded by newer techniques like looking for stalls coincident to hot spots. And newer techniques are coming online as Performance Monitoring Unit improvements become available, so the picture will evolve. However, the more things change, the more they stay the same. Among the new features of Intel Core i7 is QPI, but it still bears the same general relation to the core architecture even though it connects to different and more things (along with the integrated memory controllers) which complicate the forumulae, but there may be similar ratios (e.g., QPI saturation?) which may have bearing on some kinds of stalls.

I haven't seen any VTune articles (by David Levinthal nor by someone from Intel) distingusihing the VTune profiling analysis w.r.t specific Intel muti-core processors (Intel Xeon, Core Duo, Core2 Duo, Core2 Quad, Core i7, Intel Pentium D and Pentium) for better understanding of profiling using VTune for Intel users. I think Intel should think in publishing arcticles on VTune by being specific to muti-core processors EBS events for better learning for it's users.


I assume you are aware of the reference section of the VTune analyzer documentation called Processor Events and Advice? It may not havethe tutorial level you're looking foror assemble all the formulae together into some complete, cycle accounting whole, but it's a start. More should be forthcoming as we dig deeper and find the time to write about it. Stay on us. We appreciate your enthusiasm.

Hello Peter/Thomas/Robert.

Thanks for responding.

I am looking to do some profiling for a 8,000-10,000 lines of multi CPP file applicationon Nehalem, and compare with "Intel Xeon CPU X5355 @ 2.66GHz" processor. Could you suggest some key things needed to be compared for both processors and finally the key things to check performanceon Nehalem as Nehalemprocessor has some new features incomparison of Intel old processors.

Do you think, I should start a new thread (VTune Profiling on Nehalem) or in this thread only I can go ahead.

~BR
Mukkaysh Srivastav

Quoting - srimks
I am looking to do some profiling for a 8,000-10,000 lines of multi CPP file applicationon Nehalem, and compare with "Intel Xeon CPU X5355 @ 2.66GHz" processor. Could you suggest some key things needed to be compared for both processors and finally the key things to check performanceon Nehalem as Nehalemprocessor has some new features incomparison of Intel old processors.

Do you think, I should start a new thread (VTune Profiling on Nehalem) or in this thread only I can go ahead.

If you want to see how a program compares on two architectures, the place to start is with a comparative hot spot analysis of the same program running on comparable instances of thearchitectures, to see how individual functions scale. You might see a uniform scaling or you might see some hot spots get hotter or cooler. Focus on those and drill down to the source code level to find the regions that are taking more or less time within the function. These changing hot spots are the most important to understand, since they'll have the biggest effect on your program. Cycle accounting, stall analysis, it's been called various things, but figuring out what's delayingthe instructions is the next step. Though the architectures are different, they are also similar and have similar debug events that may be more or less effective in determining the state of the corresponding stages: all have a front end (instruction decoding) and a back end (resource scheduling, dispatch, retirement), but the number of events of interest, particularly with Intel CoreTM i7 processor, is too large to enumerate here.

A couple Intel tools provide the means to directly compare runs. Both Intel Parallel Amplifier and PTU (available on whatif.intel.com for VTuneTM analyzer license holders) offer tools to compare runs. PTU also comes with some predefined sample groups to collect events of significance, called configurations, which use selected events per architecture. Besides the basic collections, the one I'm looking at has six special configurations for Intel Core 2 processors and ten for the Intel Core i7 processor. These configurations are provided to look for specific types of stalls. For example, the Intel Core 2 processor configuration called "Bandwidth" collects BUS_DRDY_CLOCKS.THIS_AGENT, BUS_TRANS_BURST.SELF and the ubiquitous CPU_CLK_UNHALTED.CORE. This sounds pretty close to what you're looking for.

Quoting - srimks
Thomas/Peter,

Nehalem (Intel Core i7 ) being one of the Intel latest processor in "45nm Hi-k" silicon based technology. While refering the links for Nehalem, it seems that in Nehalem, the old tradition of having FSB (Front Side Bus) in Intel processor has been removed by incorporating QPI (Quick Path Interconnect).

This article discusses the LIMITS & RATIOS of events w.r.t FSB, so this analysis can't be considerd for Nehalem but this article certainly gives an insights of key EBS events to be taken care while using VTune for profiling an application for a micro-architecture.

The only thing which can be considered from this article about Nehalem is theoritical limit of CPI ~ 0.25 as qouted by you(Thomas), remaining contents of LIMITS & RATIOS can't be considered for Nehalem as mentioned in this article because this article doesn't consider measurement done w.r.t QPI.

I haven't seen any VTune articles (by David Levinthal nor by someone from Intel) distingusihing the VTune profiling analysis w.r.t specific Intel muti-core processors (Intel Xeon, Core Duo, Core2 Duo, Core2 Quad, Core i7, Intel Pentium D and Pentium) for better understanding of profiling using VTune for Intel users. I think Intel should think in publishing arcticles on VTune by being specific to muti-core processors EBS events for better learning for it's users.

~BR

BR,

the latest version of VTune, Intel VTune Performance Analyzer 9.1 Update 2 for Linux, contains predefined ratios for Intel Core i7 processors. Thisnot quite yet what you are looking for, but we are heading in this direction.

Kind regards
Thomas

Quoting - Robert Reed (Intel)

If you want to see how a program compares on two architectures, the place to start is with a comparative hot spot analysis of the same program running on comparable instances of thearchitectures, to see how individual functions scale. You might see a uniform scaling or you might see some hot spots get hotter or cooler. Focus on those and drill down to the source code level to find the regions that are taking more or less time within the function. These changing hot spots are the most important to understand, since they'll have the biggest effect on your program. Cycle accounting, stall analysis, it's been called various things, but figuring out what's delayingthe instructions is the next step. Though the architectures are different, they are also similar and have similar debug events that may be more or less effective in determining the state of the corresponding stages: all have a front end (instruction decoding) and a back end (resource scheduling, dispatch, retirement), but the number of events of interest, particularly with Intel CoreTM i7 processor, is too large to enumerate here.

A couple Intel tools provide the means to directly compare runs. Both Intel Parallel Amplifier and PTU (available on whatif.intel.com for VTuneTM analyzer license holders) offer tools to compare runs. PTU also comes with some predefined sample groups to collect events of significance, called configurations, which use selected events per architecture. Besides the basic collections, the one I'm looking at has six special configurations for Intel Core 2 processors and ten for the Intel Core i7 processor. These configurations are provided to look for specific types of stalls. For example, the Intel Core 2 processor configuration called "Bandwidth" collects BUS_DRDY_CLOCKS.THIS_AGENT, BUS_TRANS_BURST.SELF and the ubiquitous CPU_CLK_UNHALTED.CORE. This sounds pretty close to what you're looking for.

Thanks Robert/Thomas.

Will certainly look to explore Core i7 using Intel VTune - v9.1(Update 2) with what you suggested.

~BR
Mukkaysh Srivastav

Quoting - Peter Wang (Intel)

Hi,

Today I found a Intel Core 2 Quad machine, and ensure that eventCPU_CLK_UNHALTED.TOTAL_CYCLES exists in this system (Actuallyevents inCore 2 Quad are similar as Core 2 Duo).

Do you use latest product v9.1 Update 1?
I think that 5300 is Core 2 Quad, T5300 is Core 2 Duo, E5300 is Pentium (which has noCPU_CLK_UNHALTED.TOTAL_CYCLES)

You can use vtl command to export supported events name in your system - "vtl query -c sampling" to check ifCPU_CLK_UNHALTED.TOTAL_CYCLES exists.

Regards, Peter

Peter, Thanks.

I checked with Nehalem(Core i7) it has this event and also Core i7 is being populated with many sampling events which were not there in older Intel processors. I also see some SIMD related sampling events.

Thanks to Intel developers team who had brought thesesampling events for Core i7 processor.

~BR

Leave a Comment

Please sign in to add a comment. Not a member? Join today