by Thomas E. Martinez, Intel Corporation
This article identifies the features (many of which are new) in the Intel® Xeon™ processor that make it an effective and efficient platform for Microsoft .NET*. The Intel Xeon processor is based on the Intel NetBurst™ microarchitecture, which is also used in the Intel Pentium® 4 processors. This microarchitecture enhances the features of its predecessors. Some of these enhancements are discussed in this paper. A more detailed explanation of the Intel NetBurst microarchitecture can be found in the paper entitled "The Microarchitecture of the Pentium® 4 Processor" (see the References section at the end of this paper).
This article outlines some of the processing that may be required in a Microsoft .NET environment. Microsoft .NET provides a framework that includes a run-time environment that supports cross-language inheritance and provides services for developing and deploying applications that run on and across the Internet and intranets. A more detailed description of Microsoft .NET can be found on the Internet or in many books (including one listed in the References section).
Intel® NetBurst™ Microarchitecture Diagram
Figure 1. Block diagram of the Intel® NetBurst™ microarchitecture upon which the Intel Xeon™ processor is designed.
Intel® Xeon™ Processor Features
Hyper-pipelined execution - Virtually all modern computers use a pipelined architecture to overlap instruction execution to enhance performance. The Intel Xeon processor has a 20-stage misprediction pipeline, which is twice as many stages as the P6 microarchitecture that the Pentium III and Pentium III Xeon processors are based on. This increase enables processors to operate at much higher frequencies than their predecessors. At the Intel Developer Forum in Oct. 2002, an Intel executive demonstrated a processor based on the Intel NetBurst microarchitecture running in excess of 3.5 GHz and stated that this core architecture could support processors operating at frequencies up to 10 GHz.
Aggressive branch prediction - In a pipelined computer architecture, it is important to accurately predict program execution to keep the instruction pipeline full and operating at peak efficiency. Program execution branches are the locations where the instruction pipeline can stall if incorrect instructions are in the pipeline due to branch misprediction. The Intel NetBurst microarchitecture attempts to minimize the impact of branch misprediction using advanced branch prediction algorithms and a large Branch Target Buffer to store historical data for taken branches. The Branch Target Buffer on Intel Xeon processor contains eight times the number of entries than are present in the microarchitecture used for Intel Pentium III processors.
Execution trace cache - The execution trace cache is a new feature in the Intel NetBurst microarchitec ture. It replaces the L1 instruction cache present in previous Intel architectures. The trace cache stores sets (traces) of decoded micro-operations. The trace cache is most beneficial when executing loops because the loops are sequential in the trace cache and execute quickly by avoiding the decoding stages. Fortunately, most computer programs use loops extensively. In fact, it is common that 80% of program's processing is contained in 20% of its instructions. A secondary use of the trace cache is for recovery from branch misprediction. Since code execution tends to be localized to certain sections, it is likely that the new code to be executed when a branch misprediction occurs is already present in the trace cache, which allows the instruction pipeline to be filled more quickly to continue efficient instruction execution.
Out-of-order speculative execution - This feature was present in previous Intel architectures and is still present in the Intel NetBurst microarchitecture. It allows instructions to be executed as soon as all necessary data is available. This prevents an instruction that is awaiting data from blocking subsequent instructions that are not dependent on the results or data from the instruction. The result is an overall improvement in the throughput of instructions executed within a given time period.
In-order retirement - This feature was also present in previous Intel architectures. Instructions enter the processor in order. Although these instructions are likely executed out-of-order, the instructions are retired in order to ensure programmatic consistency.
System bus enhancements - The bandwidth of the new enhanced system bus is 3.2 GB/s on the initial release of the Intel Xeon processor. This speed is based upon a 100-MHz system clock which is 'quad-pumped' to provide 4 data accesses per clock cycle, yielding 400 million data accesses per second. Since the data path is 8 bytes wide, the resulting system bus bandwidth is 3.2 GB/s.
There is a correlation between percentage of bus utilization and memory latency. The memory latency is approximately constant when the bus utilization is 0% to 50%. Once the bus is more than 50% utilized, the memory latency increases almost exponentially as bus utilization increases. An analogous system is a freeway - the more cars you put on the freeway, the slower it gets. As the number of processes increases, the bus utilization percentage increases, causing memory latency to degrade scalability. The Intel Xeon processor's greater system bus bandwidth improves scalability for multi-processor systems.
New SSE2 Instructions - 144 Streaming SIMD Extensions 2 (SSE2) new instructions have been added in the Intel NetBurst microarchitecture. Single Instruction Multiple Data (SIMD) parallel processing instructions perform the same operation, simultaneously, on multiple data items with one instruction. Software that benefits from SIMD contains repetitive loops that perform calculations on sequential arrays of integers. Examples include speech compression algorithms and filters, speech recognition algorithms, video display and capture routines, rendering routines, 3D graphics (geometry), image and video processing algorithms, spatial (3D) audio, physical modeling (graphics, CAD), workstation applications, encryption algorithms .
Hyper-Threading Technology - This is a feature of Intel Xeon processors that enable each physical CPU to function as two logical CPUs. The logical CPUs share some resources and duplicate others. This technology enables two threads to execute on one physical CPU to increase the utilization of execution units (such as ALU, FPU) contained in the CPU.
Processing Required in a .NET* Environment
Security (Encryption and Decryption) - Security, as all of you know, is becoming more and more important within all computing environments. .NET uses Public/Private key encryption and metadata to validate software modules. It also provides "Passport" services as a security feature. Aside from these, much data is encrypted prior to transmission and must be decrypted upon receipt in order to process the data.
How it is processed on Intel Xeon processors - Encryption and decryption are processed at the execution level as repetitive loops. This type of operation takes full advantage of Intel Xeon processor's hyper-pipelined architecture, keeping the pipeline full, and benefiting from the trace cache because the instructions to be executed are already decoded and stored in the trace cache allowing instructions to be executed even faster. Encryption and decryption can also execute more efficiently using SIMD instructions, including SSE2 instructions. The enhanced system bus bandwidth with a 3.2 GB/s capacity enables this data to be processed quickly.
Streaming Video/Audio (compression/decompression) - Modern computer applications and the Internet are primarily designed to transfer and process information. These fundamental tasks should be completed quickly.
How it is processed on Intel Xeon processor - Streaming data includes compression and decompression that is processed at the execution level as repetitive loops. For the same reasons listed above regarding Security, the Intel Xeon processor's hyper-pipeline and trace cache process streaming data efficiently. Compression and decompression algorithms also benefit from SIMD processing, including SSE2, and from the enhanced 3.2 GB/s system bus's ability to transfer data more quickly between memory and the CPU.
Common Language Runtime (CLR) - Just In Time (JIT) compilation, Garbage Collection, Managed Environment -
The Common Language Runtime is the core of the .NET Framework with regard to execution of applications. The CLR should be aware of the hardware platform it runs on to take advantage of its features and ensure that it is not a bottleneck for executing your applications.
How it is processed on the Intel Xeon processor - The CLR is the virtual machine that executes managed code. It includes the JIT, which basically interprets and executes instructions. This involves repetitive processing on different computer instructions; it will likely benefit from the hyper-pipelined architecture and the trace cache. Since the CLR is doing many different things simultaneously, performance will benefit from Hyper-Threading Technology. However, the main point is that the CLR and its components run well on the Intel NetBurst microarchitecture. "When the JIT encounters a genuine Intel processor, the code produced takes advantage of Intel NetBurst microarchitecture innovations and Hyper-Threading Technology. The JIT compiler version 1.1, due to ship later this year, will also take advantage of Streaming SIMD Extensions 2 (SSE2)."¹
XML Processing (XML Parser) - In a world where most computers speak XML, you don't want XML parsing and processing to be a bottleneck.
How it is processed on Intel Xeon processors - XML processing software can follow many different, seemingly random, paths or branches. XML parsing involves reading XML, determining what type of data it is, and storing it in a data structure for future processing. Likewise, processing XML involves navigating a data structure to extract the data needed. Although the execution paths may seem random, most software operates on the 80/20 principle (80% of the processing is done by 20% of the code). Effective branch prediction, using the large branch target buffer and an advanced branch prediction algorithm, helps keep the pipeline full for maximum efficiency. The large trace cache makes it more likely that a recently used code path is already decoded and resident in the trace cache. These two features improve processing performance for branched code like XML by reducing the impact of branch misprediction. If the XML processor is multithreaded, it can benefit from Hyper-Threading Technology.
Web Services - Web Services is a very broad term that is hard to place in a single box. Some Intel Xeon processor features apply to some Web Services, and other features apply to other Web Services. An attempt will be made to map Intel Xeon processor features to processing Web Services as well.
How it is processed on Intel Xeon processors - Since Web Services take advantage of loops, they benefit from Intel Xeon processor hyper-pipeline features. Where they use extensive branching, they benefit from the trace cache. For Web Services that run within the CLR, they perform better based on the previous discussion regarding CLR performance on the Intel Xeon processor. When many Web Services run simultaneously or are multithreaded, they will benefit from Hyper-Threading Technology.
The Intel Xeon processor was designed to execute traditional applications (such as integer, floating-point, graphics) better that did its predecessors. It was designed for excellent performance when processing Internet applications, to efficiently execute encryption/decryption and compression/decompression, to have the capacity to process the increased amount of data, handle graphics, streaming data, and so forth. Its design features and processing power make the Intel Xeon processor an e xcellent hardware platform for Microsoft .NET.
- Introducing Microsoft .NET, by David S. Platt, published by Microsoft Press*, Redmond, WA
- The Microarchitecture of the Pentium® 4 Processor
- Boosting the performance of Microsoft .NET Framework*