Intel® Hyper-Threading Technology (Intel® HT Technology)¹ is a hardware feature supported in many Intel® architecture-based server and client platforms that enables one processor core to run two software threads simultaneously. Also known as Simultaneous Multi-Threading, Intel HT Technology improves throughput and efficiency for better performance and performance per watt. Intel HT Technology provides greater benefits in Intel® Core™ i7 processors, and other processors based on the Nehalem core including the Xeon 5500 series of server processorsthan was possible in the Pentium® 4 processor era when it was first introduced, producing better performance and power efficiency gains across a broad range of applications.
This paper takes a look at the effects of HT Technology on a server workload that is transactional in nature. Performance of the workload with HT enabled and disabled will be compared along with profile data of the application and other performance metrics of the workload. This paper assumes the reader has previously read and understood the whitepaper entitled, “Performance Insights to Intel® Hyper-Threading Technology” which summarizes how HT Technology works and how it provides performance benefits to software.
The discussion here refers to Intel® 64 architecture, with particular emphasis on Intel processors based on the Nehalem core including the Core i7 Processor and the Xeon 5500 series of processors. In addition, the techniques discussed here also apply to future Intel processors derived from the Nehalem core. This paper does not apply to the Atom®Processor family nor does it apply to the Itanium® processor family as both Atom and Itanium processors have significantly different Hyper-Threading Technology implementations.
Hardware Mechanisms of Intel HT Technology
Intel HT Technology allows one physical processor core to present two logical cores to the operating system, which allows it to support two threads at once. The key hardware mechanism underlying this capability is an extra architectural state supported by the hardware, as shown in Figure 1.
Figure 1. Intel® HT Technology enables a single processor core to maintain two architectural states, each of which can support its own thread. Many of the internal microarchitectural hardware resources are shared between the two threads.
The below block diagram of the Core i7 Processor shows multiple cores. Each core has two threads when HT is enabled. (Need figure #.) Processors based on the Core i7 Architecture exist with varying numbers of cores as shown in the graphic.
Case Study: Interpreting Profile Data With Versus Without Hyper-Threading Technology
This section explores how to correctly interpret profile data when Intel HT Technology is enabled and how to compare it to the Intel HT Technology-disabled case.
The following example is from a two-tier transactional workload. The server under test (SUT) is fed transaction requests across a network originating from another small server that emulates actual users. The workload and applications used are not disclosed due to confidentiality requirements. Profile data is examined. All data has been modified by an arbitrary scaling factor while keeping relative performance constant to obfuscate the original application and workload while retaining the realism of the analysis.
The workload analyzed here is a steady-state server workload, meaning that, with the exception of ramp-up and ramp-down time, the SUT is doing similar work at any time interval during the steady state period. If we measured a one second or 10 second interval anywhere during the steady-state time of the workload, we would see nearly identical characteristics in areas such as instructions retired, CPU utilization, I/O rates, and thread behavior. In the workload analyzed here, more than 100 transactions are being serviced at the same time by the server. While the transactions are not all identical, on average, over a 10-second interval, performance characteristics remain constant.
This SUT is at or near 100% CPU utilization for the entire run. Performance data is collected using VTune Performance Analyzer and then normalized to the measured transaction rate of the workload. The result is that the data is displayed as the number of events per transaction. This approach allows easy comparison of the runs with Intel HT Technology turned on to the runs with Intel HT Technology turned off.
The Benefits of Normalizing Profile Data
The straight forward way to analyze profile data is to take a profile of a running application, make a change, profile again, and compare the profiles. This works sufficiently well in many cases, though, it is not completely accurate way to compare data. The issues that arise can make it especially difficult to analyze workloads that have a flat profile. This is the case with many server workloads where it is not unusual for the hottest function to be less than 5 or 10% and for the top 30 or 40 functions to only add up to 50 or 60% of total cpu utilization with the rest spread out across possibly 100’s of other functions.
The main issue that arises can be seen from a simple example. Imagine two workload runs where some modifications have been made to the application between the runs. Vtune profiles are collected for 30 seconds on each workload. Assuming that the both workloads run at 100% utilization. The number of CPU cycle events collected over the 30 second interval will be the same between the two workloads regardless of the performance difference of the two workloads. Now imagine all functions increased in performance due to a compiler optimization switch. The profile in % of CPU cycles may not appear much different even though the performance delta was signficant.
What we really want to know is how many cycles each function is taking relative to how much work is being done. To do this, we need to normalize the profile data. We then can compare profile data from multiple runs and indeed across multiple systems or systems with different numbers of cores and still have a meaningful way to compare the performance data.
Normalizing Profiling Data to Units of Work
For our example, we will normalize the profile data to units of work. Since this is a transactional workload, we will normalize the data to transactions. This will give us performance data in terms of amount of cycles to complete a transaction, number of instructions to complete a transaction, and so on for all other collected counters. In actuallity, the workload consists of several different transactions of varying complexity, so our normalization is more like a weighted average of the number of cycles to complete the different transactions in the system. For this technique to work, we must know that the workload exhibits steady state behavior. By this we mean that the work being done “on average” over any time slice is essentially identical in behavior to any other time slice during the interval. In this case an interval size of 2 or 3 seconds is sufficient for this to be true. We sample our data for 30 seconds so we are confident that the data analysis technique is valid. Though it is not show here, the high level results are repeatable. There are other methods used to determine whether a workload is a “steady state” workload that are outside the intended scope of this paper.
Normalization is accomplished by dividing the total event counts from the profile data by the number of work units completed during the sampling interval. In this example, VTUNE profile data was collected for 30 seconds. The score from the workload driver for the HT enabled case was 12,387 transactions per second (Tx/sec). To normalize the data:
Normalized Count Result = EventCount /(SampleInterval * Score)
For this case:
Cycles per Tx = 1,473,928,000,000 (cycles) / (30 (seconds) * 12,387 (Tx/second)) = 3,966,330 Cycles/Tx
The easiest way today to normalize the data is to save the VTune view data as comma separated text files. These can then be loaded into Excel and calculations implemented to do the normalizations.
Counter data should be taken from the Process View within VTune to calculate the overall normalized values for the entire workload and to calculate core and thread CPI which we discuss later. The process view will give the best overall view and eliminates having to get all function data to make the same calculation.
The two tables below show the system configuration for the system under test (SUT) and the high level performance data normalized to the work output.
Table 2 shows high level data for the server transactional workload with HT disabled and HT enabled normalized to the workload throughput.Table 1 - The following table shows the system configuration for the system under test (SUT):
Table 2 - The following table shows the high-level workload performance data:
System configuration between the Intel HT Technology enabled and disabled runs is identical in all respects except for the Intel HT Technology setting in the BIOS (disabled versus enabled.) When comparing performance between Intel HT Technology disabled and enabled configurations, it is important to compare instructions retired, CPU cycles per unit work, and instructions retired per unit work (which we also refer to as path length), in addition to CPU utilization.
The default frequency for the Xeon X5570 processor is 2.93 GHz. Intel® Turbo Boost Technology is enabled, which results in higher-than-nominal frequency for the run. CPU utlization is a approximately three percent lower in the Intel HT Technology enabled run. In this case, it may have been possible to increase performance by two or three percent with additional tuning to increase CPU utilization in the Intel HT Technology-enabled case. CPU utilization is not a primary factor in the performance delta, as it is very small compared to the performance gain.
Impact to End User Response Time when Enabling HT Technology
The response time as viewed by the client dropped from 50 ms per transaction to 37 ms per transaction with Intel HT Technology enabled. While this result may seem counter-intuitive to some readers, it is in line with expectations. Please refer to Intel Application Note “Hyper-Threading: Be Sure You Know How to Correctly Measure Your Server’s End-User Response Time” for more information on this topic.
Performance Improvement of Workload due to HT Technology
The performance improvement for this workload is 30 percent (1.30x) when Intel HT Technology is enabled (9,553 trx/sec versus 12,387 trx/sec) as seen in Table 2 (Transaction Rate). The workload consists mostly of integer ALU operations. To determine the source of the performance gain, one should first consider the pathlength, which is the average number of instructions executed to perform one transaction and the cycles per transaction. In this case, the pathlength is nearly identical between the two runs. This is not always true in the general case, since it is more highly dependent on individual workload characteristics than to anything specific to Intel HT Technology. Because the pathlength is the same, the number of instructions executed in each case does not contribute to the performance delta. We next look at cycles per transaction.
Change to Average Cycles per Workload Transaction due to HT Technology
The next data point to consider is cycles per transaction (see Table 2), which is actually the average number of CPU cycles per transaction taken over a large number of transactions. In the Intel HT Technology-disabled case, CPU cycles for one transaction average 2.66 million cycles, while for the Intel HT Technology-enabled case, CPU cycles for one transaction average 3.97 million cycles, which will come as a surprise to many readers. That is,the performance increased by 1.3x, yet the average cycles per transaction increased by 1.5x. While it is true that the cycles per transaction increased by 50 percent with Intel HT Technology enabled, the big difference is that there are twice as many threads running when Intel HT Technology is enabled. Some quick math can help to verify these numbers.
Total performance can be calculated by the following equation:
Total trx/sec = (1/(average cycles/trx)) * (processor frequency in cycles/sec) *
(number of threads) * (total cpu utilization)
For the Intel HT Technology-disabled case:
Total trx/sec = (1/2,658,340) * (3.196e9) * 8 * 0.993 = 9,553
For the Intel HT Technology-enabled case:
Total trx/sec = (1/3,966,330) * 3.183e9 * 16 * 0.965 = 12,387
Thus, the test data is in keeping with expected performance. Note that the numbers work out exactly because the cycles per transaction were obtained by counting total cycles over an interval and then normalizing to the work done. It was also possible to accurately measure CPU utilization using performance counters and an internal data-collection tool. The calculations above are shown to illustrate that even though the cycles per transaction increase, the processor has twice the threads available, so overall throughput still increases. If we compare these results with the chart in Figure 4, we see that these results also correlate well with predicted performance increases for Intel HT Technology and the effect on per-thread compute cycles. Following the x-axis in Figure 4 out to 1.3x, and looking at the curve, the chart predicts a 1.5x increase in per-thread compute cycles. (It is important to remember that the chart in Figure 4 is only accurate for workloads at or near 100% CPU utilization.)
Change to Cycles per Instruction (CPI) due to HT Technology>
Now that the data has been validated, insight into why the Intel HT Technology-enabled case is faster can be found by looking at CPI, or average cycles per instruction (the average number of cycles to execute each instruction going through the machine). Please refer to the whitepaper “Performance Insights to Intel® Hyper-Threading Technology” in the section titled “Core CPI vs. Thread CPI” for a broader discussion of Core CPI vs. Thread CPI. Figure 8 shows two lines for CPI, “Thread CPI” and “Core CPI”. Looking at these two lines for the Intel HT Technology-disabled case shows that Thread CPI and Core CPI are identical. This is because there is only one thread running on the core when Intel HT Technology is disabled, so they are logically and physically equivalent. For the Intel HT Technology-enabled case, we see that the Core CPI is half of the Thread CPI.
Figure 8 shows two lines for CPI, “Thread CPI” and “Core CPI”. Looking at these two lines for the Intel HT Technology-disabled case shows that Thread CPI and Core CPI are identical. This is because there is only one thread running on the core when Intel HT Technology is disabled, so they are logically and physically equivalent. For the Intel HT Technology-enabled case, we see that the Core CPI is half of the Thread CPI.
Thread CPI is calculated the same way for both cases, it is simply the following:
Thread CPI = (average cycles per trx) / (thread pathlength)
The Core CPI calculation is as follows:3
Core CPI = (average cycles per trx) / (thread pathlength * number of threads)
Thus, the performance improvement is due to each core being able to execute more instructions in a given number of cycles. Comparing Core CPI between the Intel HT Technology-disabled and Intel HT Technology-enabled runs, we find that the Intel HT Technology-disabled run Core CPI is 1.234, and the Intel HT Technology-enabled run Core CPI is 0.921. That is, Intel HT Technology-enabled Core CPI is reduced to 0.75x of the Intel HT Technology-disabled Core CPI for this workload. If CPU utilization was 100% and pathlength is identical between both runs, the performance delta should equal the Intel HT Technology-disabled Core CPI divided by the Intel HT Technology-enabled Core CPI, which in this case is 1.33x. In this case, pathlength is identical between the runs, but CPU utilization is almost 3% less on the Intel HT Technology-enabled case, which explains why the measured performance improvement is 1.30x and not 1.33x.
Profile Results and Observations from Profile Data
Now that we know that the performance delta is due to Core CPI improvements with Intel HT Technology enabled, what can we learn from examining VTune analyzer data? Table 3 shows data from the top 15 functions of this server workload collected on the SUT
Table 3: Profile data from the application normalized to transaction rate.
Explanation of Table 3 Column Headings
The column “Function Profile Order” shows the relative ranking of the functions in terms of hotness, with the data sorted from hottest to least hot for the Intel HT Technology-enabled run. As can be seen from the Profile Order data, the relative ranking can change between the Intel HT Technology-disabled case and the Intel HT Technology-enabled one. Looking at CPU Thread Cycles per Transaction, we see that all the functions take more cycles with Intel HT Technology enabled compared to Intel HT Technology disabled. This result is expected, as we know the total cycles per transaction increased from 2.66 million to 3.97 million when enabling Intel HT Technology. This is a 1.49x increase in cycles for the entire workload. Functions that increased less than 1.49x in cycles contributed more to the average speed increase than those functions that increased more than 1.49x in cycles when enabling Intel HT Technology.
The “% of Total Cycles” column is the number of cycles for the function divided by the total number of cycles. All the displayed data is normalized to the units of work as previously mentioned.
The “Thread CPI” column is calcuated by taking the “CPU Thread Cycles per Transaction” and divide it by the “Instructions Retired per Transaction” for each function.
The “CPU Thread Cycles per Transaction” column and the “Instructions Retired per Transaction” column are the raw counts for each function from the VTune data normalized to the units of work (transactions) as explained in the previous section titled“Normalizing Profiling Data to Units of Work”.
Note that the Thread CPI of each function typically has increased. Once again, this result is expected, as the overall average Thread CPI for the entire workload has increased from 1.234 for the Intel HT Technology-disabled run to 1.842 for the Intel HT Technology-enabled run.
It is important to note that it is not accurate to talk about Core CPI of any of the functions in the above profile data for the Intel HT Technology-enabled case. (Core CPI is equivalent to Thread CPI for the Intel HT Technology-disabled case.) The reason is that the Core CPI is calculated by the number of instructions running through the core on two separate threads. The profile data shown is average samped CPI for each function. In fact, the actual Thread CPI for each function at any specific instant will be affected by what code is running on the other thread. For this reason, it may be necessary to increase the amount of sampling time for profile data when Intel HT Technology is enabled versus when Intel HT Technology is disabled in order to ensure a statistically valid result. So, while we were able to divide the Thread CPI by two to get Core CPI for the overall workload results, there is no concept of Core CPI for a function when Intel HT Technology is enabled. Again, the accuracy of this discussion depends upon the assumption that CPU utilization is near 100 percent on a steady-state workload.
One final point can be made from the profile data. If we think about what is occuring on the core when a function with a high CPI is running, it is clear that it will be easier to sneak in a few instructions without impact to the other thread than if the function has a low CPI and is attempting to retire instructions at a high rate. In most cases, functions with low CPI with Intel HT Technology disabled showed a bigger CPI increase when Intel HT Technology is enabled than functions that have a high CPI. In general, functions that have a significantly high CPI with Intel HT Technology disabled tend not to have very much CPI increase when Intel HT Technology is enabled. This was generally true in this data set, and that result is not unusual, although it cannot be considered to be absolute, since multiple factors will determine the CPI increase between the cases where Intel HT Technology is enabled and disabled, the most important of which is what function is running on the other thread.
One final take-away from the profile data above is that it is imperative to understand how Intel HT Technology affects the data collected in order to correctly determine the true effects. As the data shows, Intel HT Technology will increase CPI and total cycles for each unit of work completed and for each function in the profile, even though the performance also increased. Knowing that this result is expected allows one to analyze the data in an accurate and objective manner..
Intel HT Technology boosts performance for many applications resulting in higher performance and higher efficiency. Applications that scale well with cores will typically also scale well with HT Technology. The Nehalem core brings many improvements that complement HT technology allowing significant performance gains. The server workload analyzed here shows what to expect and how to interpret VTune performance data when comparing performance data with HT enabled and disabled. These analysis techniques, including data normalization are also useful for performance analysis in general.
It is important when evaluating performance of applications running with HT Technology to understand the differences in performance tool data and that many times, comparing data between HT disabled and HT enabled systems requires more than an intuitive understanding to accurately assess the performance implications.
Finally, Intel is committed to helping the ISV community and system users attain the best performance on Intel systems. We encourage you to visit our developer community for any questions not answered here.
About the Authors
Garrett Drysdale is a Sr. Software Performance Engineer for Intel. Garrett has analyzed and optimized software on Intel platforms since 1995 spanning the client, workstation, and enterprise server spaces. Garrett currently works with enterprise software developers to analyze and optimize server applications, and with internal design teams to assist in evaluating the impact of new technologies on software performance for future Intel platforms. Garrett has a BSEE from University of Missouri-Rolla and a MSEE from The Georgia Institute of Technology (Go Jackets!). His email is email@example.com.
Matt Gillespie is an independent technical author working out of the Chicago area, specializing in emerging hardware and software technologies, with a focus on what's new at Intel. Matt has also worked hands-on in IT, implementing voice and data networks. In an earlier life, he was a researcher at financial publisher Morningstar and at the University of California Center for Neuroscience.
Performance tests and ratings are measured using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/.
1 Intel® Hyper-Threading Technology (Intel® HT Technology) requires a computer system with an Intel® Processor supporting Intel HT Technology and an Intel HT Technology enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support Intel HT Technology.
2 Source: Intel. Configuration: pre-production Intel® Core™ i7 processor with 3 channel DDR3 memory.3 Thus, for one thread per core (Intel HT Technology disabled) the Core CPI equation is eqivalent to the Thread CPI equation. The Core CPI equation multiplies the pathlength on the denominator by the number of threads, because, physically, two threads are indeed running the same number of instructions through the core at the same time. Note that this equation for Core CPI is only strictly true at 100% CPU utilization. It is a very accurate assumption above 95% processor utilization. At CPU utilizations of 50% or less with Intel HT Technology enabled, the Core CPI equation above may be accurate, or could underestimate CPI by a factor of two because it is unknown how often two threads are running simultaneously on the same core.