Intel® Itanium® Microarchitecture Support for .NET* and Java*

by Matt Gillespie


Introduction

The Intel® Itanium® processor is ideally suited to hosting the application servers for enterprise-class.NET* and J2EE* applications. In addition to the support for the key Managed Runtime Environment (MRTE) application servers (including optimization for many of them), the processor provides predication, speculation, and explicit parallelism. Working relationships between Intel and Microsoft, BEA, IBM, Oracle, and others ensure that the largest possible number of application servers will perform optimally on 64-bit Intel® architecture.

The Itanium processor's predication feature eliminates many of the performance penalties associated with branched code; this capability is particularly relevant in the branch-intensive code associated with MRTE’s such as .NET and Java platforms. Likewise, those branches benefit from the superior speculation capabilities of the Itanium processor, which allows very early data pre-fetching from memory. The explicit parallelism provided by the Itanium processor allows efficient execution of managed code-based applications based on large numbers of components.


Application Servers Place High Demands on Server Hardware

Application environments based on managed components, including Microsoft .NET* and Java* technology, depend upon large numbers of small method calls. This characteristic introduces a significant amount of overhead associated with frequent branches in the compiled code. This large number of branches increases the incidence of branch mis-prediction by the processor execution unit, which can severely impact performance.

Therefore, controlling the level of branch mis-prediction is of elevated importance in a managed code environment, and the ability of the processor architecture to handle such mis-predictions while still performing acceptably is vital.

In addition, the overhead associated with MRTE features such as garbage collection and security requires additional processor resources during execution. Since the garbage-collection mechanisms in MRTE’s require significant computation operations to support them, those mechanisms are run somewhat sparingly.

In the MRTE environment, memory tends to remain unavailable for some period after it is no longer in actual use, and so the executing platform requires more memory than for a corresponding unmanaged (native code) application. It bears mention here that this relative increased memory requirement assumes that the unmanaged application manages memory correctly; the ability to simplify memory management by abstracting it away from the developer is the very reason for automatic garbage collection's existence.

Given sound coding procedures, however, the scalability of managed enterprise applications is dependent to a higher degree than that of unmanaged applications upon the processor's ability to address large amounts of memory.


Itanium® Microarchitecture Features Meet Specific MRTE Challenges

The Itanium processor has a number of features that address the challenges of branch-intensive code and the need to address large amounts of memory, as described in the preceding section. By breaking through the 4GB barrier of supported memory imposed by traditional 32-bit architectures, the Itanium microarchitecture is able to support the needs of enterprise-scale managed applications to access very large amounts of physical memory. Features that provide robust performance for branch-intensive code include predication, speculation, and the processor's explicitly parallel design.

Branch prediction is the traditional means by which processors increase the performance of branched sections of code. By making an informed guess about which branch of the code will be relevant to an individual execution session; processors can execute that branch in advance, thereby having the appropriate output on hand when required.

The better the prediction algorithms are, the better the performance will be, since the processor must discard the output of branches it unnecessarily executes and go back to execute the correct branch. While the benefits of branch prediction nevertheless outweigh the deficits of such so-called 'branch mis-prediction', the large numbers of branches in managed code lead to significant mis-predicted branches, which can reduce performance to unacceptably low levels.

Predication is the Itanium processor's ability to eliminate the performance deficits associated with mis-predicting branches by executing both branches in parallel and discarding the result of the branch that is not needed.



The simplified figure above demonstrates how branch mis-prediction adds to execution time, and how predication avoids that performance deficit. In the branch-prediction scenario at left, the processor mis-predicts the branch at time point t=1, causing it to backtrack at t=2, back to the other branch, wasting the execution time it spent on executing the first branch. The code execution proceeds through two additional branches, one of which the processor predicts correctly, and the other of which it mis-predicts. The entire execution wastes two units of time, finishing at t=9.

By contrast, the predication scenario at right executes both strands of code in each of the three branches, allowing the code to execute without deficits associated with branch mis-prediction and finishing at t=7. In reality, the performance differential between the two models is more significant than what is represented by this diagram, since each branch mis-prediction requires the processor to place instructions at the beginning of the processor pipeline once it discovers the branch mis-prediction. Thus, depending on the specific processor architecture, the output of the remaining branch of code will not be available for a number of clock cycles, until it has passed through the entire pipeline.

Bottlenecks are generally associated, in any processor platform, with the latency involved in exchanging information between the processor and system memory. The continuing advances the computer industry makes in processor speed, relative to memory speed, compound this issue. Various architectures address this latency by implementing pre-fetch and caching techniques to fetch data from memory before the execution engine needs it. In addition to the very large multi- level caches provided by the Itanium processor, the Itanium microarchitecture also provides speculation.

Speculation is the ability of the Itanium processor to pre-fetch data beyond the next branch in code, unlike many competing architectures, which are only able to fetch data as far in advance as the next branch. By loading all data as far as possible in advance of the actual load instruction, speculation dramatically reduces the effect of latency associated with data exchange between the processor and system memory.

By contrast, standard pre-fetching methods depend upon loading data from memory after the branch begins executing. While pre-fetching increases performance, it is more likely than speculation to cause processor stalls because of data preloading not being complete by the time the processor requires that data. Thus, speculation, like predication, benefits the branch-intensive code associated with.NET and Java application servers.

Applications present instructions to the Itanium processor in bundles of up to three instructions each, which the explicit parallelism of the processor design allows parallel execution. This parallelism enables compilers for managed runtime environments to identify opportunities to bundle the method calls that are so prevalent in managed code together for concurrent execution. That simultaneous execution increases code performance, particularly in the presence of a highly optimized Java JIT compiler.

The upcoming planned release of the Intel processor code-named Montecito, and the later version code-named Tukwila (formerly know as Tanglewood), will further enhance these parallel processing capabilities. These processors, which were announced by Paul Otellini in his keynote speech at the Fall 2003 Intel® Developer Forum, will put multiple processor cores on a single die. In essence, they will expose two (in the case of Montecito) or more (in the case of Tukwila) Itanium processors to the operating system for each processor chip. These advances will add even further to the performance benefits for managed code that are available from the Itanium microarchitecture.


Development Tools Complete the Picture to Enable Real-World Solutions

A growing number of third-party tools address the needs of 64-bit application development for managed environments, including .NET, Java, or a combination of the two. Rational PurifyPlus*, for example, detects memory problems including pointer errors, corruption, and leaks in managed code.

Pointer errors are a particularly common hazard when porting applications from the 32-bit environment to the 64-bit environment, due to the differences in variable types and pointer addresses. The complexity and expense generally associated with tracing pointer issues make it particularly beneficial to automate the process.

The .NET Framework for 64-bit Intel® architecture, currently in a pre-release version, will provide 64-bit support for Intel Itanium-based computers. Generating 64-bit binary executables is currently possible using Microsoft Visual Studio* .NET. The release of the 64-bit version of the .NET Framework will bring additional .NET functionality to Itanium-based platforms by offering new n ative functionality.

The Intel® VTune™ Performance Analyzer identifies and aids in the resolution of performance bottlenecks. It integrates directly with Visual Studio .NET for development of .NET applications, and it is also very well suited to Java development environments. An example of its utility in Java development is shown in the Intel® Developer Services article, "Profiling Servlets Running on BEA WebLogic* with Intel® VTune™ Performance Analyzer."


Conclusion

Hardware features that are inherent to the Itanium microarchitecture benefit applications in general, and the managed code in .NET and Java applications in particular. Predication is a great leap forward from branch prediction, since by executing both parts of a branch, it eliminates the costs associated with mis-predicted branches. Speculation allows data to be cached at a far earlier point in application execution than traditional pre-fetch capabilities. The explicit parallelism of the Itanium processor allows bundles of up to three instructions to be executed simultaneously.

These features are very well suited to the nature of enterprise-scale managed applications built under the .NET framework and the Java platform. The large number of components and method calls, and the branch-intensive properties of such applications yield good results on Itanium-based systems. Tools from Intel and third-party independent software vendors provide unparalleled support for efficiently creating best-of-class applications that are optimized for the platform.

Intel Itanium architecture provides a solid foundation for managed-application performance today, and the rich Itanium processor roadmap promises to improve on that performance steadily in the years to come.

About the Author:

Matt Gillespie is an independent technical author and editor specializing in emerging hardware and software technologies. Before going into business for himself, Matt developed training for software developers at Intel Corporation and worked in Internet Technical Services at California Federal Bank. He spent his early years as a writer and editor in the fields of financial publishing and neuroscience. You can reach him at spanningtree@comcast.net.


Additional Resources

Intel, the world's largest chipmaker, also provides an array of value-added products and information to software developers:

 


Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.