Measuring Performance of Applications on Virtualized Systems Under Test (SUTs)


Virtualization is now ubiquitous in data centers for the consolidation of multiple workloads on to a single platform. However, there are very few performance studies of ISV application workloads when virtualized.  One goal of Intel’s virtualization program is to provide advice to ISVs to accomplish these performance studies.  This white paper is one piece of collateral to achieve that aim.

In this paper, we outline some key concepts:

  • Introduction to the importance of virtualization for ISVs
  • Performance metrics for SUTs and potential issues with these under virtualization
  • Virtualization use cases and suggested test configurations
  • Techniques to measure virtualization overhead
  • Simple recommendations for ISV studies

Introduction: Why is Virtualization Important Today

The use of virtualization is rapidly growing in the computer industry today. Virtualization offers the opportunity for better server manageability, provisioning and cost. A number of use case, the most popular being server consolidation, which involves moving multiple server workloads on to a single platform, are growing rapidly in datacenter environments. Numerous virtual machine monitors (VMMs or hypervisors) from VMware, XenSource, Microsoft, and others, are now widely available for Intel Architecture platforms to manage the virtual machines.  In addition, Intel has added key hardware features [Intel VT reference] that make virtualization very attractive, in terms of cost tradeoffs.

However, from a performance standpoint, many new issues can arise under virtualization. Clearly virtualization, while striving to maintain as low as possible an overhead with sophisticated software and hardware techniques, does indeed add some overhead to application performance. In this paper we will specifically address the performance measurement methods which might have to change to achieve the most useful data when measuring a single application under virtualization.

In the consolidation case, it is expected that the performance of each of the virtual machine may be significantly affected by the other virtual machines running on the same platform. Since each of the virtual machines can be running entirely different operating systems and workloads, the overall performance behavior of consolidated scenarios will be significantly different from traditional server workloads that run a dedicated application on a platform. While such complex testing setups are beyond the scope of this paper, many of the techniques and methods outlined below are also appropriate for testing such complex setups.

Our key goal is to identify specific issues that can occur when applying classic computer performance measurement to an application that is running on a Guest OS over a Virtual Machine Manager (VMM).  We also suggest workarounds to these issues and list best know methods to make the testing process as painless as possible.

Performance Metrics for SUTs

This whitepaper assumes that you alre ady have a reasonable performance methodology in place to measure and analyze your computer system performance.

Many good references exist for computer performance measurement, including:

  • R. Jain, The Art of Computer Systems Performance Analysis. J. Wiley, 1991
  • The Software Optimization Cookbook, Second Edition: High-Performance Recipes for IA-32 Platforms by Richard Gerber, Aart J.C. Bik, Kevin B. Smith and Xinmin Tian, Intel Press, 2005
  • Improving .NET Application Performance and Scalability, J.D. Meier, Srinath Vasireddy, Ashish Babbar, and Alex Mackman, Microsoft Corporation.

For virtualized performance measurement, there are fewer references available, but some very good ones do exist. Many of the issues addressed in this paper were first outlined in Redefining Server Performance Characterization for Virtualization Benchmarking issue on Virtualization: In particular, the paper: Redefining Server Performance Characterization for Virtualization Benchmarking defines many of the “issues” we talk about below.  Some of the key Virtualization issues that this paper walks thru in detail include:

  • Time Drift: A Guest OS may not report accurate time
  • Availability & Usage of Performance Monitor tools is poor.
  • Multitude of configurations
  • Resource contention w/ other Virtual Machines
  • Consistency of results

Other Key References

General Performance Metrics

Your application’s performance metrics should be defined by the business usage and usage environment and their definition should not vary between native and virtualized mode. Key metrics can be categorized into Throughput, Latency, Response Time, etc and include such examples as:

  • Maximum Number of concurrent users
  • Maximum Number of transactions per second (throughput)
  • Minimum response time (latency)
  • Maximum response time (latency)
  • Average response time (latency)
  • Maximum number of dropped requests, packets, users, etc.
  • Minimum processor utilization in %
  • Maximum processor utilization in %
  • Processor utilization in %

In addition, sophisticated workloads often define a metric of achieving the maximum performance (users, transactions/sec) while keeping other metrics within reasonable criteria (for example, average user response time less than 2 seconds).  Under virtualization, these Quality of Server (QoS) constraints can occasionally be violated in different performance ranges that under non-virtualized situations.

Special Issues with Performance Studies under Virtualization

Changes to the Workload Definition

One possible reaction to virtualization is to radically change your workload to meet real end user use cases, for instance, by mimicking consolidation configurations.  See the VCON example in the Appendix for some flavor of this change. However, these types of configurations become increasing complex to capture in a workload definition.

Changes in the overall measurement methodology

Virtualization can offer, or can demand, new methods of performance assessment on your current workloads.  Refer to the example in the Appendix describing performance and utilization graphs.  In the past, you may have only been chartered to look at high watermark levels of your workload. Now it might be very useful to look at your graduated workload along points of varying CPU utilization.  This data directly supports consolidation use cases.

On the plus side, building a single VM image with your test setup pre-defined may be of some logistic help in testing “the same setup” on different platforms – now you can avoid full re-installation of your software stack on test systems. At the very least, virtualization can offer new ways of managing your setups.

Virtualization offers a multitude of new configurations and possible scenarios for you to test

There may be a number of new configuration types you could add to your base test suite.  You may need to vary key parameters associated with the VMM, such as:

  • Number Virtual Machines
  • Configuration of individual VMs (Processors, Memory, NIC, HBA, HDD)

In addition, setting key affinity properties (Intel® VT-d device assignments: network cards, disks, etc. to VMMs) as well as CPU affinity of VMMs may have large impacts on performance outcomes.

Virtualization brings another layer of setup

Even if you don't "test" alternative to the VMM configuration, as above, you at least need to set and maintain these new parameters and options in your testing tools. These “best known setting” should be passed on to your customers.

Achieving highwater performance results under virtualization might require different tunings from native

Besides the normal procedures and options to set application parameters, there are additional parameters to consider in tuning.

Likewise, applications parameters themselves may need to vary, given the perturbations (say in timing/latency of I/O requests) that may now be hidden under the Guest OS

Resource Utilization Measurements might be problematic

Tools used in the Guest OS (Windows perfmon CPU Utilization graph, Linux SAR techniques) may not give accurate picture of the full resource utilization.  This is especially true with multiple VMs running at the same time on the same platforms.

Some VMMs provide ways to view the full platform resource (CPU, memory, etc.,) usage.. For instance, Xen has an oprofile port called "xenoprof", which allows accurate measurement the CPU use of everything on the system..

VMM design issues or per VM constraints may limit performance

For instance, many VMMs allow each VM to only a subset of the full set of hardware resources (say cores) on the platform.  Thus SMP Scaling may be poor or not supported

Memory scaling may be limited; Actual memory needed may be slightly higher than normal to account for VMM usages.

New “performance” bottlenecks (perhaps due to latency increases) may arise in the application, due to issues with VM enter/exists and I/O virtualization.

Virtualized Devices may offer less performance / features than native access

If your application deals directly with hardware or talks directly to specialized hardware drivers, then these features may not be available directly in the Guest OS.  Your driver may or hardware may need to be virtualized by the VMM.

Virtualization Use Cases and Recommended SUT Configurations

Application Virtualization

The Application can be run – using one virtual machine – in a virtualized environment to achieve nearly equivalent performance to non-virtualized execution.

SUT Configurations

This is simple. Add a VMM layer hosting the current OS as the Guest OS.  Test as usual. Report maximum / highwater performance and compare to native performance on the same platform. Add measurements at varying load levels (CPU utilizations) as appropriate.  See the next section on “Virtualization Overhead” for more details.

Application Scaling

The Application can be replicated – using more than one virtual machine – in a virtualized environment on a platform to achieve performance scalability that the Application, OS and Platform can not achieve natively. This use case is especially applicable for applications that may be difficult to scale performance due to a non-threaded nature or issues in coherency on large SMP configurations.

SUT Configurations

Start simply with one VM and then scale to 2 VMs etc. to determine scalability. Goal would be totally consume the first constraining resource (CPU or I/O bandwidth) on the platform.  For each VM employed, maintain its configuration as matching the single native OS image. Set VMM allocations to match native need for each resource. Report maximum scaling and scaling per VM.  Also, report maximum native performance achievable on same platform. 

Tier Consolidation

Application components running on separate systems (each as one of N systems in a Tier or each as a Tier) can be mapped to, and deployed in, multiple virtual machines running on fewer systems (N-1 tiers). This may or may not also translate in performance improvement. For example, this use case could be used to lower infrastructure support costs by mapping web/ application and database server components running separately on two physical systems (2 tiers) to two separate virtual machines running on a single physical system (1 tier).

SUT Configurations

Here the best configuration to choose will be driven by the application tiers and their definition.  A simple solution is to have each application tier instance (those would run on separate physical system) virtualized and complete consolidation done onto a single platform.  These may well be a “worst case” test, as certain components (say the database layer), might not be virtualized in a real end-user install ation.  In this case, it is hard to get “apples to apples” hardware setups between the native and virtualized configurations. Report native performance and number and type of machines required and virtualized performance on the consolidation platform.

Server Consolidation

The most ubiquitous use case for servers in the enterprise is "server consolidation".  We will define that here as: Redeploying a set of running applications from a set of servers to a smaller subset of servers by encapsulating each server into its own VM and hosting the VM on the target subset of platforms. Of course, to apply this in a real situation assume that the current servers are severely underutilized, but that appears to be a common situation in many enterprises today.

SUT Configurations

Testing server consolidation, in a real world usage scenario, would involve loading a variety of applications, with varying temporal load.  This is often too difficult and time consuming to feasibly do in a virtualization study, and so is not included in the Intel® Technology Enabling Program (TEP) for Virtualization.

Other Enterprise IT-Oriented Use Cases

IT-oriented use cases are specific to IT’s needs to manage its enterprise. We will only briefly mention these types of use cases here for completeness. This kind of testing effort is “beyond the scope of this white paper.” These could include

  • Failover, Backup,  & Disaster Recovery
  • QA & Test Servers Consolidation
  • Patch management
  • Load Balancing,
  • Etc.

For each, performance metrics (time to failover, time to release patches, etc.,) can be defined.

Other ISV Product-Oriented Use Cases

Likewise, we shouldn’t forget about the potential positive impact that new uses of virtualization can bring to the direct business process undertaken at an ISV. 

Ideas such as using Virtual Appliances to simplify the delivery of applications functional to end customers may entail large changes to product QA processes. Again, we will only mention these here for completeness; as we don’t expect these to have general SUT testing plans. 

Measuring “Virtualization Overhead”

While many of the above use cases are “end-user” oriented, potentially against a large system landscape “sprawl” of servers and applications, good engineering data can be ascertained by assessment of a single application on a single server. This is the simplest and fastest way to get a data on the performance impacts of having end users run your application virtualized.

In comparing performance under a single VM versus native performance on the same platform, many of the first published studies driven by the VMM-vendor are a subset of “Application Virtualization” use case and are used to measure “cost of virtualization” or the VMM/VM overhead.  This overhead directly allows one to see the extra hardware and software costs of virtualization on performance.  It is also one of the simplest measurement types to employ.

Additional criterion to performance studies of this type is to mainta in “similarly configured” setups between native and virtualized modes.  This is the “similarly configured” system criterion. Under this criterion you should attempt to have:

  • The number of Virtual CPUs = number of Physical CPUs in native mode and
  • The physical memory assigned to Guest OS = Physical memory in native mode
  • All other important resource constraints (IO bandwidth) equalized also

To do these, you have to remove or limit the amount of memory and/or processors in the native case, to get these direct comparisons.

Recommendations for ISV Studies

  • In the first round experiments, do not change your workloads or make major changes to your measurement methodology.
  • Reuse current “test methods” and “performance measurement” harnesses
  • Use the KISS principle in your experimental design – get simple single machine measurement of “virtualization overhead” as your first engineering goal
  • Start with one VMM (Viridian, VMWare ESX, etc.)
  • Start by using only one GuestOS/ VM rather than a sprawl of them
  • Keep to a minimum any re-tunings of your application in virtualized mode
  • Start tuning VMM settings with a non-“high-water” level of performance, say 50% utilization, if you have a graduated workload.
  • Do NOT oversubscribe and VM resources (especially memory – over subscribed memory can be devastating to performance)
  • Use Domain0 or VMM tools to measure full system load, especially if you have multiple Guest OSs
  • Report and maintain good application internal timing / performance data (say, how long you hold locks, if your code does lots of locking). This can be service as great performance diagnostics, whether in native or virtualized modes

A simple flow for verification of virtualization is given in a separate white paper and the step by step flow it suggests is also appropriate for performance studies.

Please also see additional details on filling out the Statement of Work and final report for Intel’s Technology Enabling Program (TEP) for Virtualization. 

Appendix: Benchmarking Concepts

Workload versus Benchmark

We use these terms interchangeably in the white paper.

A workload is a set of jobs or tasks to be done by a computer system under test which represents the actions and tasks that an end user would place upon the software and hardware.  It defines a performance metric which allows comparisons between SUTs.

A benchmark is a workload that has been standardized to allow industry-wide comparisons among platforms. Any benchmark that has been in existence long enough to become "industry-standard" has been extensively studied and may also have a standards controlling body.  In virtualization space, some benchmarks include Vcon and VMMark. SPEC has a working committee to help define a standard benchmark for the virtualization area.

High Watermark workload result

A High Watermark refers to the “best” performance metric res ult that can be (or as been) measured over all possible configuration tunings on a given physical SUT. 

Virtualization brings additional tuning parameters that the tester must worry about and potentially it might require new setting of existing application tuning parameters to achieve the best high watermark.

Graduated vs. Non-Graduated Workload

Describes a workload with the ability of the tester using the workload to set the amount of work done per unit of time (e.g., by driving the SUT with different loads) on each run, so that a pre-set target performance level is achieved or by a target CPU utilization is achieved.

A Graduated workload will allow more through performance testing & testing at vary loads.  This is especially important to replicate virtualization consolidation use cases that are most appropriate when individual applications are not fully utilizing its current host platform.

For non-graduated workloads, often it only the total amount of wall clock time (duration) that the benchmark uses to execute which is measured. If a non-graduated workload is CPU-bound (as opposed to I/O-bound), it often exhibits the behavior of peaking the system at 100% cpu as it is running (during its High Load phase).  Thus for non-graduated workloads, your only alternative for virtualization measurements is to report the elapsed time of the benchmark.

High-load phase of a workload

The high-load phase of a workload is that period of the workload’s execution when performance measures are designed to be measured.  In a Graduated workload, generally there are “ramp up” and “ramp down” periods proceeding and following the high-load phase which allows the benchmark to hit at steady state

One issue with-respect-to virtualization with the High Load phase definition is that on any particular execution run, it may start or stop sooner or later than expected from the native mode run.  Generally most workload tools will properly account for these.

Appendix: vConsolidate: An example of a consolidation benchmark definition

Intel has proposed a complete Enterprise virtualization server consolidation workload (vConsolidate). This benchmark definition can be used to give you some of the details required to define consolidation-type benchmarks.

vConsolidate consists of a compute intensive workload, a web server, a mail server and a database application running simultaneously on a single platform (see Figure)

Each of these workloads runs in its own VM. To emulate a real world environment an idle VM is added to the mix since datacenters are not fully utilized all the time.  The compute intensive VM runs SPECjbb. Typically SPECjbb is a cpu intensive workload that consumes as much cpu as it possibly can. However, in this environment, namely vCon, SPECjbb has been modified to consume roughly 75% of the cpu or so, by inserting in random slee ps every few millisecs. This is to represent workloads that are more realistic.  The database VM runs Sysbench; an OLTP workload running transactions against a mysql database. The Webserver VM runs Webbench which uses Apache Webserver. The Mail VM is a Microsoft Exchange workload that runs transactions on Outlook with 500 users logged in simultaneously. A configuration as described above with 5VMs running the different workloads comprises a Consolidated Stack Unit or known as CSU. The diagram in the Figure above represents a 1CSU configuration.

Appendix: Example: Virtualization’s Potential to Impact Platform Sizing Studies: Platform Utilization vs. Performance Graphs for a Single Application

In the server consolidation use-case, an end-user reduces the number of platforms they have deployed by reducing the amount of machines in their sprawl of machines and removing idle time from those remaining machines. There is an implied tradeoff being made in this – resource utilization on a single machine will increase more than the sum of the individual native workloads placed upon it. This is because even single application virtualization can by itself increase the amount of machine resources required to achieve a given performance level while meeting the constraining Quality-of-Service goals. This does not take into consideration any negative interaction between VMs on the same machine which may require assigning increased resources (memory, CPU time, etc.,) to resolve.

This tradeoff can drive the definition of an individual application’s scaling study.  A key metric to help size workloads is the Performance (Load, Trans/sec, Users ...) against platform Utilization (CPU). This must be measured for a few data points for both the virtualized and native, non-virtualized workload, on the same platform.  Once it is measured, detailed estimates of consolidations and platform sizing and configurations can be made.

As a example experimental methodology, the data one might choose to gather would represent single application scaling and is shown in the following figure:

The data was generated (hypothetically) by an engineer working on her application with a graduated workload.

  • She developed a tuned benchmark setup in native mode,
  • She took native performance measurements at 10, 20, 30, 40 and 50% CPU utilization
  • She tuned the virtualized setup and loaded the platform near 90% utilization. In this example, three VMs were required to raise the application scaling to this level (if you can do this with 1 VM running your workload, that is best, but if you need multiple VMs to achieve application scaling, then use the minimum number of VMs)
  • She took measurements at 10% thru 90% utilization configurations in steps of 10% in virtualized mode.

Now with this engineering sizing study in hand, various tradeoff calculations can be made. For instance, suppose you want to consolidate two servers running this application into a single server.  Suppose you currently run two servers at 20% utilization (giving an average performance need of 200+200=400 units).  To meet this average performance on a single virtualized server (of the same type), from the data above, the user mus t run at least 80% utilization and run 3 VMs. 

Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.