Performance Optimization and Platform Monitoring

Introduction


This discussion covers some of the needs and implications that drive one to optimize and manage one’s platform. In this, we disclose opportunities to influence the ultimate performance of a computer system at the architectural, platform and software levels, and provide a rationale for the benefits that accrue.

What is Performance Optimization?


Performance optimization is the process of making a computer system work more efficiently or use fewer resourcesi,ii. Optimizations can take place at the architectural level, at the application level, while creating software code and on a platform by platform basis.

Why do we want to optimize performance? There are many reasons including:

  • Get your results quicker,
  • Make systems more responsive,
  • Systems that are more efficient frequently use less power,
  • Get more bang for your buck (save money),
  • Sell more software (if you are a software company),
  • Sell more systems (if you are a hardware company),

So, to optimize performance, we need to monitor the computer platform and its resources.

What is Platform Monitoring?


Monitoring is ‘the act of observing something’. A platform is the hardware architecture and software framework (both operating system and application) that allows software to runiii. Platform Monitoring observes the hardware and software of a computer system.

Platform monitoring covers many areas, including:

  1. Operating System (OS) level data such as data from Microsoft* perfmon counters and Linux* sysstat counters
  2. Power usage by the whole system or by components (the CPU, disks, or memory),
  3. Software application performance such as transactions per second for transaction oriented software or frames per second for games,
  4. Errors such as machine exceptions or memory errors,
  5. CPU level counters such as instructions retired, CPI (clockticks per instructions retired), memory bandwidth.

To learn more, refer to manuals describing how to program the events in the CPU counters and documents describing the events and how to derive meaningful performance measures such as memory bandwidth.

Performance Optimization Basics


The Closed Loop Cycle
One of the important methodologies in performance optimization is the ‘Closed Loop Cycle’ (see figure 1).


The steps in the Closed Loop Cycle are:

  1. Gather performance data so that you can accurately describe the current performance.
    1. Select a performance metric (like transactions per second or frames per second) that is important to the workload you are running.
    2. Use this metric to track performance during each iteration of the cycle.
    3. The metric should be reproducible, that is, if you rerun the workload with no changes, you should get the same performance level within some tolerance.
    4. The performance shouldn’t change when you run the test longer or shorter.
  2. Analyze the performance data and identify performance issues.
    1. Sometimes when you make changes, you might make something worse that was previously fixed. Performance can be like a balloon that, when you push down on one part, another part expands. You have to check all the performance data, even sections of data that you thought you had ‘fixed’.
  3. Generate alternatives which may enhance the performance.
    1. There may be several possible enhancements. Pick one with a good pay back and try to keep it simple.
    2. Do the easy stuff first.
  4. Implement the enhancement.
  5. Rerun the workload and go to step 1

An important aspect to performing the Closed Loop Cycle is that you should only change one parameter per cycle. If you change more than one parameter and the performance changes, then you won’t know which change impacted the performance.

The Top-Down Tuning Methodology


What performance data should you gather? A tried-and-true tuning methodology is the top-down methodology. See figure 2.

Figure 2: Top-down Tuning Methodology


With the top-down methodology, you start with the slowest items (like network traffic and disk operations) at the system level and work your way down to the fastest operations (like memory bandwidth and events within the CPU like branches retired). You’ll need to collect data for each level in order to see whether items at that level are a bottleneck (slowing you down).

At the system level:

  1. On Microsoft Windows*, you can use tools like Microsoft perfmon.
  2. On Linux, you can use the sysstat package (for more data) or vmstat (for less data).
  3. Using the above data, you can see if network traffic is slowing you down (get faster or more network cards or reduce the amount of traffic)
  4. If disk traffic is slowing you down, get more/faster disks or reduce the amount of disk traffic needed.
  5. Fixing system level bottlenecks can sometimes provide the biggest speedup and be easier to fix.
  6. Once you are satisfied with the system level performance, you start with application level performance.

Application level performance:

  1. Microsoft perfmon and Linux sysstat (pidstat) data includes lots of info at the process and thread level. You can identify which process is using the disk or network, how many threads each process has, how much memory each process is using, etc.
  2. There are many other tools such as APIMON,
  3. This is usually a good time to look at the CPU time for your application. Using a tool like Intel® VTune™ or the Linux utility Oprofile, you display the time for each module and function. Sometimes you’ll be surprised by what is actually running slow. As you try to get your software to scale to more cores or handle bigger workloads, you may find algorithms in the code that work OK for light loads but not for heavy loads.

Microarchitectural level performance:

  1. Intel® VTune™ (Windows or Linux), Intel® Performance Tuning Utility (Intel® PTU), and Oprofile (Linux) can provide data at the microarchitecture level.
  2. Using the event data from tools like VTune™, you can view important info such as the memory bandwidth, mispredicted branches, cache misses, CPI and many more important performance metrics.
  3. Generally microarchitectural tuning is challenging to do. Using a good compiler can fix or avoid many of the problems. Usually you have to change the source code to fix these issues.

Summary


Within Intel, we use the Closed Loop methodology and the top-down tuning method. Intel engineers work with many different software packages to analyze the performance. Even though we are primarily concerned with microarchitectural performance, we still have to go through the system level and application level before looking at the microarchitectural level. This is because changes at the system level or application level can completely change the microarchitectural performance.

As computers try to make things simpler for the end-user, the hardware and software is becoming more complicated. Computer servers supporting clouds of users might have dozens or hundreds of cores. Computer processors increasingly are integrating graphics capabilities onto the processor. Smart phone processor may actually have dozens of different chips onto the same processor. Intel is creating complete ‘systems on a chip’ for cells phones. This wide variety of hardware and software will create new platform monitoring and performance optimization challenges. This website will be your source for Intel-specific platform monitoring and performance optimization assistance.

i http://en.wikipedia.org/wiki/Program_optimization
ii http://en.wikipedia.org/wiki/Performance_tuning
iii http://en.wikipedia.org/wiki/Computing_platform
iv http://sebastien.godard.pagesperso-orange.fr/
* Other names and brands may be claimed as the property of others.

如需更全面地了解编译器优化,请参阅优化注意事项