Measuring Application Performance on Multi-core Hardware


Introduction

From the surface, benchmarking applications in a multi-core environment appears very similar to benchmarking applications in a single-core environment. In reality it is more complex, in particular because careful analysis of benchmark results on multi-core systems can lead you to discover performance issues that need to be fixed to attain desired performance.

This paper is primarily intended for engineers that are benchmarking applications on multi-core systems, and those using benchmark results to tune the performance of threaded applications. It may also be good background for those interpreting the output from such teams. This paper will not help those who wish to make purchasing decisions based on benchmark results. After completing this paper, the reader should possess a basic understanding for how to gather meaningful benchmark data and tune applications on multi-core systems.


Prerequisites

Before benchmarking your application on a multi-core system, we need to cover some basics. First, we need to make sure we understand the characteristics of the application being measured. Then we need to make sure we have workloads that actually give us the performance data we need to collect.


Definitions

When talking about multi-core systems, a number of terms come up that need to be defined:

  • Concurrency. This is computing in which multiple tasks are active at the same time, but may not be executing at the same time. Concurrency is used in modern operating systems to hide latencies, to provide high availability when combined with redundancy, etc.
  • Threads. Operating-system constructs that allow a single process to execute more than one stream of instructions concurrently. The application developer usually needs to explicitly code the use of threads into an application.
  • Parallel Computing. The execution of multiple threads on multiple processing units. The purpose is to decrease program runtimes, and is one of the key benefits for multi-core systems.
  • Throughput Parallel Computing. In this type of parallel computing, the goal is to improve the aggregate performance of a collection of jobs or tasks. This is usually implemented using software that monitors one or more machines, and assigns work to these machines from a job queue when any of them has free resources. Such software allows multiple jobs in the collection to be executed in parallel, which usually takes less time than if they were executed serially (one after another).
  • Turnaround (latency-sensitive) Parallel Computing. For this type of parallel computing, the goal is to decrease the runtime of a single task or job by causing some of its operations to complete in parallel.
  • Serial Code. Portions of an application where only a single thread executes. This may be a portion of the code that cannot be threaded, due to algorithmic restrictions, within a multithreaded application. As we will see below, serial execution ultimately limits the scaling of a multithreaded application as the number of cores increases.

 


Questions to Ask

Here are some important questions to ask before attempting to benchmark an application:

Does your application make use of threading?

This is an important question to answer. Currently, most applications are not threaded. If the application is not threaded, then the performance of a single instance of that application will not change no matter how many cores are in the system (all other things being equal – clock speed, CPU architecture, memory subsystem, etc.).

In this case, the benefit of a multi-core system is not one of increasing single-application performance, but rather how many serial applications can run on the system at once without interference. Those other applications can be additional copies of the application under test, or they can be entirely different processes (virus scans, music playback, printing, etc. – in other words, this is an example of “throughput parallel computing”).  Performance analysis under such conditions will typically try to identify cases where the operation of one serial task interferes with the operation of another serial task.

If the application is multithreaded, then, in theory (all other things being equal), the performance of the application will increase (its runtime will decrease) as the number of cores increases in a system (this is the idea behind “turnaround parallel computing”). How much the performance of the application actually increases will depend on what else is running on the system, how effectively the application was threaded, the percentage of threaded code that is exercised by the benchmarking workloads, and the nature of the hardware platform.

The rest of this paper will assume that your application is multithreaded.

How many threads are used by your application?

If your application is multithreaded, it makes sense to periodically review the implementation of that parallelism. One very important question to ask is whether the application is designed to make use of all cores in a multi-core system, or whether it caps the number of threads it creates to some necessary or artificial limit. Knowing if such a limit exists in your code will help you predict the scaling of your application as multi-core systems come out with more and more cores, and can help you identify areas in your application where the threading implementation may need to be reviewed.

Knowing performance on a single-core platform, what is the expected performance on a system with N cores?

This is the key question. The benchmarking process below will help you answer this question, but you should try to predict it in advance, knowing what you do about the architecture of your application and the use of rough predictive mechanisms like Amdahl’s Law. (Discussed below, Amdahl’s Law basically predicts how the performance of a parallel application changes as the number of cores increases based on the amount of serial and parallel work in that application) Mismatches between measured parallel application performance and predicted performance are a red flag that should be investigated. Application analysis is required to discover the causes of poor application scaling with increasing numbers of cores.

When will the work complete on a system with 2N cores?

This question is similar to the previous one, with one important distinction. If your customer is running on a system with N cores, you should use the data you have at hand to predict, and then actually test, the per formance of your application on a system with twice as many cores. The purpose of this experiment is to allow you to identify and fix any scaling problems in your application before your customer discovers them on upgrading to new hardware with 2N cores. Above all other things, you do not want the performance of your application to decrease when your customer upgrades from a system with N cores, to one with 2N cores.

Thus, you should always benchmark your multithreaded application (if possible) on systems with more cores than your typical customer configuration. As discussed below, this involves testing the performance of your application with 1, 2, 4, ..., 2N active cores in order to get an accurate picture of how your application scales.

Does the application force the use of specific system resources, or does it adapt to the system it is run on?

For example, if your application has been threaded, it is possible that your development team used a method called “processor affinity” to make sure that each thread of your application always runs on a specific core, rather than whichever core is free as scheduled by the operating system. One reason to do this is to attempt to ensure that data critical to the thread stays in a given core’s cache, which can improve performance. Why bring this topic up? Because you need to make sure that your developers have made your application adaptive, so that it accounts for the number of cores, processor architecture, etc., of a given hardware platform. Incorrect assumptions can result in unexpected benchmark performance and possibly even performance decreases.


Benchmark workloads

Critical to the success of any performance measuring effort is the use of good workloads in your benchmark. These workloads must be:

  1. Representative. Make sure your workloads contain data sets that mirror actual customer use of your application (e.g., if your customer typically operates on 100MB of data during a particular task, don’t create a workload that tests the same task with only a 5MB data set). Testing with non-representative data sets (both in terms of amount and content) can lead you to incorrect conclusions about what parts of your code are critical to the overall performance of your application.
  2. Thorough. The workloads must cover all major application usage models and functional areas in approximately the percentage seen during typical customer use. It is important to have benchmark workloads that measure performance in all areas important to the customer, especially when you start measuring the effects of changes you hope will improve performance (you don’t want to miss code changes that speed up one workload, but slow down another).
  3. Repeatable. A benchmark workload is pretty much meaningless if it is not repeatable. This means that it should complete in the same amount of time when run over and over again on the same system, and that it should exercise the same application functionality in the same way every time it runs. This can be challenging to do, especially with multi-tier applications, since it may involve rolling a database back to a specific state every time the workload is run, or require that a server application be driven by a networked client with the same data over and over again. You may need to use automation software to benchmark complex systems in a repeatable manner.

 

For unit testing, your benchmark workloads need only to be repeatable. For any sort of overall performance measurement, analysis, and tuning, they need to be representative and thorough as well. This holds for any benchmarking activities, no matter what the nature of the hardware (multi-core, single-core) or application (threaded, multi-tier, serial).

Benchmark workloads should generate some sort of performance metric (FLOPs, MIPs, etc.). Short of that, their runtime should be accurately measured from start to completion by some automated mechanism.


Benchmarking

When benchmarking applications on multi-core systems, you need to pay attention to not only the overall performance of the application, but also to how the performance of the application changes as the number of cores increases. This latter information will help you predict performance on a similar system with more cores and help you identify areas where performance tuning might be necessary. Getting the best performance out of multithreaded code on a multi-core system is much more complicated than simple performance optimization of a serial application on a single-core system. Also, understanding how the performance of an application changes as the number of cores changes can also make for more accurate comparisons of performance on two different systems.

Serial Baseline
If your application is threaded, it is very valuable to run your benchmark workloads when the application is running on a single-core system. If at all possible, this should be done on a multi-core system where the operating system has been booted to make use of only one core. On Windows this is done by rebooting the OS with “/NUMPROC=1” in the BOOT.INI file, on Linux using the “maxcpus=1” bootparam. Both of these require that you have installed your operating system with an SMP kernel at the start.

The data from these single-core baselines is used to calculate how the performance of your application changes as the number of cores increases. This, in turn, will help you identify and fix parallel performance issues in your application.

Gather Scaling Data
Now that you have the single-core baseline, re-run your benchmark workloads on the test system, increasing the number of active cores by powers of two (2, 4, 8, …) until you reach the maximum number of cores in your system (ideally 2 times the typical customer configuration). The performance data generated will tell you how much runtime has decreased on this platform by implementing parallel computing in your application, and how the application runtime changes as the number of cores increases.

Gather OS and Platform Performance Data
Another important set of data to gather when benchmarking an application on a multi-core system is operating system performance data. Using your favorite operating system-specific tools, collect system performance metrics, such as overall processor utilization, user-level runtime, system-level/kernel runtime time, interrupts per second, context switches a second, I/O levels, memory traffic, etc. Collect as many of these during each of the benchmark runs detailed above as possible. The purpose of this data is to help you understand how much of your application is running in parallel, and also what sys tem stresses or limits, if any, are being reached due to increased use of system resources by your application.

Rerun if you change your application code or benchmarks
It is important to re-gather your performance data, in particular the single-core baseline and the case when your application is using all available cores, if you make changes to your application code or benchmark workloads. Your changes may alter the application’s serial performance, as well as (hopefully) decreasing the runtime when using all the cores on the system. As a result, you need to re-collect these data before you can come to any conclusions about how the scaling of your application changed due to your modifications.

Analysis
You now have data in hand showing how your application scales as the number of processor cores increases, how much its runtime changed by running on a multi-core system, and OS performance data that gives you a sense for how the system as a whole performs under your multithreaded benchmark workloads.

Warning: It may be tempting at this point to compare your data to that collected on other systems. Technically, such comparisons are only really useful if there is only one variation between the two systems (such as clock speed, amount of memory installed, etc.). Especially in the case of multithreaded code on multi-core systems, significant differences between two platforms can make it nearly impossible to answer why one system was “faster” than another…only that it was, which can be a problem if you want developers to improve performance on the slower system.

Estimate serial portion of benchmark workload
Using the collected performance data, you need to look for indications of poor or unexpected scaling as the number of executing threads and cores increases. To do this, use your OS performance data to roughly estimate the amount of serial (non-threaded) work done for each benchmark workload. It is usually easiest to approximate this from the performance data you collected when only two threads were running on two cores.

 

Figure 1

 

In Figure 1, you can see that roughly 75% of our runtime was spent in serial code, while there was (poor) parallel execution for the remaining 25% of runtime. How long will the benchmark workload run if all the work is done by a single thread on only one core, and how much of that runtime is actually spent in the serial (non-concurrent) code?

This is easy to find. The serial runtime will remain constant, while the parallel work will now take twice as long (ideally).

Total runtime is thus: 0.75 + (0.25*2) = 1.25
Serial percentage of runtime = 0.75 / 1.25 = 0.6 or about 60%
“Parallel” percentage of runtime = (0.25*2) / 1.25 = 0.4 or about 40%

Calculate ideal scaling for 2..N cores
The key insights here are that, while the number of cores (and thus active threads) increases, the runtime taken by the serial code will remain unchanged, and the runtime taken by the concurrent code will be the time to run that code in serial divided by the number of threads, N (in the ideal case). Thus, the total runtime with N cores and th reads for our example will be approximately:
Total runtime = 0.75 + (0.25*2)/N = 0.75 + 0.5/N

Or more generally:

Total runtime = serial_runtime + (time_in_parallel_region_while_code_running_in_serial / N)

This is basically a restatement of Amdahl’s Law, which says that:

Tpara = {(1-P) + P/N}Tserial
N = number processors
P = % time spent in the concurrent region of the code
1-P = time spent in the serial region of the code
Tpara = parallel runtime
Tserial = serial runtime

Note that Amdahl’s Law is too optimistic: it ignores overhead, assumes an even division of computation between threads and cores, and it assumes 100% processor utilization in the parallel region of code (which we can clearly see is not the case in our example above).  

We can estimate the speedup (also called scaling factor) of the workload on a given number of threads and equal number of cores by:

Speedup = Tserial / Tpara

 

From these equations, you can generate Figure 2, which shows speedup curves for applications with different degrees of serial code as you increase the number of threads and cores:

 

Figure 2

 

As you can see, applications must have very small amounts of serial code to scale well on systems with more than two cores..

Compare your prediction to reality

Having collected all your benchmark data, and used your performance data to predict your application’s runtime (and thus speedup) for N processors, we can now compare the two and look for anything unexpected. Figure 3 shows the speedups measured on an application with several different workloads with 2 and 4 cores active.

 

Figure 3

 

This graph (plotted with your data – an example is shown in Figure 3) is extremely useful, and tells you a number of things:

Looking at the measured data only, you can see which benchmark workloads have poor multithreaded performance on the 4-core system, and which perform really well. The reasons for either extreme should be examined in more detail.

Looking at the speedup estimates we get from Amdahl’s Law (the green bars - very idealistic and approximate), we can identify workloads where something unexpected is happening in the predicted vs. actual speedup. These are the workloads where you want to do further analysis and performance tuning if you wish to improve application performance on a multi-core system.

In the example above, you will note that the performance (scaling factor) of Workloads 1 and 2 increases as the number of threads (and thus cores) increases, and that the predicted and measured 4-core performance numbers match well. For Workload 11, the prediction was that the workload should show little performance improvement when run on 4 cores (this workload exercises very few parallel regions, and thus runs almost entirely on a single thread no matter how many cores are in your multi-core system). However, we find that the application actually slows down when running on 4 cores compared to when it runs on 2 cores (yellow arrow in Figure 3). This should be a cause for some concern, but depending on how often a similar workload is expected to be run by users of the application, it may not be worth the time to investigate and fix.

Workloads 5 and 12, however, show the two most worrying cases where predictions do not match reality (red arrows in Figure 3). In Workload 5, the performance of the workload yielded a respectable speedup from one thread (scaling factor 1) to two threads, but stayed flat going from two threads to four. This is in opposition to our predictions from Amdahl’s law, which would suggest much higher performance of the workload on a quad-core system. Workload 12 is distressing because the performance of this workload on a quad-core system is drastically less than the performance on a single-core system! Again, this is not what we would have predicted from Amdahl’s Law.

As you can see, the value of a graph, like the one in Figure 3, is that it helps you organize a large amount of data in a way that will enable you to prioritize additional analysis and tuning efforts. It also gives you a sense for the performance of your application as a whole on a multi-core system. This is why it is so important to have a set of representative, repeatable, and thorough benchmark workloads before you measure performance on a multi-core system.

Tuning

The process for identifying the cause of performance problems revealed by benchmark workloads, like those illustrated in Figure 3 above, is too complex to go into detail here. Up to this point we were able to use simple system monitoring tools provided by the operating system to get a sense for how our application runs on a multi-core system.

Correcting performance issues in multithreaded code, however, requires much more complex tools and a fair amount of skill. Performance issues in multithreaded code are difficult to find using a conventional debugger because:

  • Nothing is really breaking, so it’s hard to know what is causing an issue even if you manage to set a breakpoint at the code that is causing the performance issue
  • Running a debugger (or any heavy-weight performance-analysis tool) on multithreaded code can totally change the behavior of your code. None-the-less, this is what we often need to do.

 

As you saw above, the key to wringing more performance from a multithreaded application is to minimize the amount of serial code executed during each benchmark workload (if possible). In general, this can be done by going through steps like the following:

  • Run the benchmark workload whose performance you wish to improve on a multi-core system while running a performance-analysis tool such as Intel® VTune™ Performance Analyzer. This tool will tell you what parts of your code are being run during this benchmark workload.
  • Relate the performance analysis data for this workload to your knowledge of the application architecture. There were likely good reasons why the code you see running in serial does not run in parallel. Those decisions may need to be re-evaluated if you wish to get maximum application performance on a multi-core system.
  • If the performance analysis shows that your application is mostly running in parallel, and it still did not scale well, then you likely have some synchronization issue that is limiting appli cation performance. Synchronization issues can “serialize” your application just as effectively as writing a serial algorithm. In this case you’ll need to use a tool that specializes in locating performance issues in multithreaded code like Intel® Thread Profiler.
  • Once you know what synchronization issues are holding back the performance of your application on a multi-core system, you will again need to go back to the architecture of your application and try to figure out a way to minimize this synchronization.
  • Make the necessary changes and the re-run your application again with only one core active, and with all cores active (with as many threads as cores on your system), to see if overall performance and scaling have improved.
  • Remember that if you make changes to code that is used by your other benchmark workloads you’ll want to recollect the data for those workloads again as well – you want to make sure that any changes you made to improve one benchmark workload didn’t hurt the performance of other workloads.

As you might expect, the performance tuning process is iterative. The difference between this process for a serial application on a single-core system and a multithreaded application on a multi-core system is that you need to take more care, because threading adds additional ways you need to optimize and introduces new ways in which well-intended optimizations can go wrong.


Conclusion

Measuring performance on multi-core systems using benchmark workloads is not all that different from the way you measure performance on single-core systems. You need to decide benchmark metrics that are important to you, and then test your application using benchmark workloads that provide meaningful benchmark results.

Benchmarking on multi-core systems also introduces the new concept of speedup, which boils down to asking how the performance of your application changes as the number of cores increases, and whether you find those results acceptable. In general, you want application performance to increase with the number of cores, and you need to understand (and maybe fix) any situations where performance levels off or decreases.

* Other brands and names may be claimed as the property of others.

 


Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.