by Paul Steinberg, Sharat Agarwal, and George Vorobiov
In collaboration with Intel, ISVs like Microsoft and BEA are redefining the performance, scalability, and security of solutions based on Managed Runtime Environments.
A Managed Runtime Environment (MRTE) works best when the architecture is tuned to support it. Working with ISVs, including BEA and Microsoft, to understand the relationship of MRTEs to the hardware, Intel has built support for both Java and .NET MRTEs into the Intel® architecture. Future generations of Intel® processors will build on these gains.
The BEA WebLogic JRockit* Java Virtual Machine (JVM)* is optimized for Intel architecture. Intel's work with BEA to optimize this platform for current and future Intel® hardware is ongoing. Some key platform features have already been incorporated into the JVM.
Hyper-Threading Technology Satisfies Java's* Hunger for Threads
With Hyper-Threading Technology, one physical processor functions as two logical processors. Since Java has an insatiable demand for threads, it can benefit from this Intel innovation more than any other commercial language.
When JVMs can schedule threads to execute simultaneously on multiple logical processors, performance improves. The execution resources on the chip can also be used more efficiently than when a single thread consumes those resources. As Hyper-Threading technology becomes more prevalent on desktop processors, Java performance will benefit greatly by interleaving the JVM threads.
The JRockit* Compiler Generates Native Code Optimized for Intel Platforms
JRockit supports several garbage-collection techniques. Although it is unavoidable, garbage collection is one of the big performance bottlenecks in Java. The JVM needs to reclaim memory in order to do its job effectively. JRockit techniques include generational concurrent, generational copy, single concurrent, single copy, and parallel.
Each of the garbage collection algorithms has its own merits and enhances certain kinds of applications. For instance, a concurrent garbage collector does most of its work at the same time that the Java application executes. Thus, it is ideally suited for applications and workloads with strict response-time criteria, because it introduces minimal latencies.
JRockit also supports two different thread systems: the High-Performance Threading system and Native Threads.
Under the High-Performance Threading system, several Java* threads are run under the same operating-system thread. These Java threads are relatively lightweight and demand fewer system resources than an operating-system thread. JRockit optimizes thread scheduling, thread switching, and thread synchronization with less memory, so a higher number of threads are run more efficiently.
The Native Threads system maps each Java thread directly to an operating-sys tem thread and utilizes the operating system's thread-scheduling and load-balancing policies.
Figure 1: BEA JRockit supports two thread systems: Native Threads (the default) maps one application thread to each operating-system thread; High-Performance Threads maps multiple application threads to each operating-system thread.
Finally, JRocket supports synchronization enhancements. Thin locks provide low-overhead lock acquisition and release mechanisms. A lock is inflated to a "fat" lock only when it is contested, reducing the overhead associated with acquiring or releasing uncontested locks. Fat locks are more than four times as expensive as thin locks, so policies that inflate only when necessary keep costs as low as possible.
Thus, a key question in terms of performance is when to invoke the Just-In-Time (JIT) compiler and how often. JIT optimizations may make use of profile-guided or sampling-based methods of JIT-compiling byte code.
JIT Optimizations Add Intelligence to Branching
State-of-the-art JVM implementations include sophisticated JIT-optimizing compilers to minimize application execution time. These tools take advantage of a static call graph that is inferred from the byte-code structure and execution-time profiling to determine the most feasible code-generation methodology.
Because of the intrusive, on-the-fly nature of JIT compilation, optimizations in this phase must be limited. Although branching and indirection can be reduced by intelligent JIT compilation, a large burden still falls on the processor to quickly execute this complex code stream. The Pentium® 4 and Intel® Xeon™ processors provide tailored support for code that is heavy with branching and indirection.
Advanced branch-prediction circuitry can reduce delays for a correctly predicted branch to almost zero clock cycles. The JIT cooperates with the processor by generating code that correctly predicts branches in most cases. An L2 data cache of 512KB manages indirection efficiently. This large cache, combined with support for speculative loads (loading memory before a branch is resolved), results in excellent throughput and utilization of memory bandwidth.
The combination of Java and Intel architecture is designed for speed. Some of its features are a 20-stage execution pipeline, double-clock arithmetic units, a 12K micro-operation instruction-trace cache, a 512KB L2 cache, and support for Hyper-Threading technology.
To satisfy the needs of Java applications for faster processors, Intel supplies ever-increasing clock rates on both the Pentium 4 and Intel Xeon processors. Keeping a cutting-edge processor fed with instructions and data is accomplished by interfacing with the system bus at an effective data-transfer rate of 3.2GB per second.
.NET Optimizes Garbage Collection, Threading, and Synchronization
Intel is working with Microsoft to improve CLR performance and to make sure that Microsoft .NET runs best on Intel architecture. Improvement of the runtime is essential in these areas.
The CLR use s a generational "mark and sweep" garbage-collection algorithm that is based on the assumption that the most recently created objects have the shortest time to live. The premise of this strategy is that the majority of small objects will be reclaimed during the garbage collection for generation 0 (youngest), and a special heap exists for large ones. The Intel-Microsoft .NET optimization team dynamically adjusts the amount of memory allocated for generation 0 according to cache size.
Another garbage-collection optimization modifies the way the mechanism relocates data that survived collection to align double values on natural boundaries, thereby avoiding cache-line splits.
Locking-mechanism optimization uses thin locks to provide an improved lock acquisition/release mechanism for uncontested locks. This method is based on the analysis of different CLR-based managed applications, showing that most of the lock requests were uncontested. Better support for Hyper-Threading technology is provided based on the use of the store instruction as opposed to lock cmpxchg to release the object lock.
The Intel-specific instruction-set features enable SSE-based dynamic code generation when running on Intel architecture. They also modify the CLR JIT mechanism to emit specialized implementations for the Dbl2Lng() method (a common source of CLR hotspots) if SSE extensions are supported by the processor.]
Both the Java and .NET MRTEs provide significant benefits to the enterprise. IT managers can achieve better time to market and feel confident in more-secure, lower-cost systems. Developers can be far more productive when runtimes handle mundane concerns such as memory management and security.
For instance, memory management has been one of the trickiest tasks for developers to handle. Indeed, the ability to understand, keep track of, and properly reference and de-reference pointers was the hallmark of a master programmer.
Today, all managed runtimes contain mechanisms for automatic memory management. Typically, these mechanisms include pointer counting, array-bounds checking, and the allocation and de-allocation of objects.
- Managed code is safer: built-in security conventions help prevent abuse by malicious code and careless programming.
- Deployment is faster: the common runtime frameworks enable development for and deployment to many different platforms and languages without complex recoding.
- Applications are stable: both the programmer and the IT staff can take advantage of platform runtimes for optimum code performance and stability.
Abundant computing power makes MRTEs practical now. Managed runtimes will force the developer and the IT manager to rethink what makes an application fast.
- Intel® Developer Zone Java Resources brings together articles and related resources from the software industry's most qualified developers and analysts.
- The February, 2003 issue of the Intel® Technology Journal is dedicated to managed runtime technologies as they relate to the current topics that developers value most [pdf, 976K]: http://download.intel.com/technology/itj/2003/volume07issue01/vol7iss1_managed_runtime_technologies.pdf
- The JavaOne Conference is the pre-eminent Java event, with technical presentations on all that is new in the Java sphere: http://www.oracle.com/us/javaonedevelop/index.html*
Intel, the world's largest chipmaker, also provides an array of value-added products and information to software developers:
- Intel® Software Partner Home provides software vendors with Intel's latest technologies, helping member companies to improve product lines and grow market share.
- Intel® Developer Zone offers free articles and training to help software developers maximize code performance and minimize time and effort.
- Intel Software Development Products include Compilers, Performance Analyzers, Performance Libraries and Threading Tools.
- IT@Intel, through a series of white papers, case studies, and other materials, describes the lessons it has learned in identifying, evaluating, and deploying new technologies.
- The Intel® Academic Community Educational Exchange provides a one-stop shop at Intel for training developers on leading-edge software-development technologies. Training consists of online and instructor-led courses covering all Intel® architectures, platforms, tools, and technologies.