J2EE* Application Tier Tuning

Submit New Article

December 8, 2008 8:00 PM PST


by Dan Middleton


Introduction

This article is the second of two that present a systematic technique for optimizing Java code, specifically addressing a three-tier architecture that is extensible to any multi-tier environment. First read J2EE Environment Tuning, which describes how to performance tune a J2EE environment. Complete those steps before you attempt to optimize the application tier on its own.

This document is an intermediate discussion of performance tuning. It is intended for software developers who have a minimum of one year of Java* development experience, familiarity with developing test workloads, and use of performance and monitoring tools.

Professional assistance is available from Intel® Solution Services to tune your environment with the additional benefit of passing on this knowledge to your staff.


Tuning Strategy

The tuning strategy is the same as described in J2EE Environment Tuning. Continue to look at the system counters and conduct experiments in a closed loop fashion (Figure 1.). The only difference is that you are focusing in on a tighter area.


Figure 1. Closed Loop Testing Methodology.

Collect system data from the other tiers throughout the testing. It may be useful to refer to later on. Moreover, every time you make a significant change to the application tier, you must re-evaluate the other tiers. You may be bottlenecked on another element of your environment after increasing the application tier's performance.


Analysis

Begin with the assumption that bottlenecks in the rest of the environment have been removed. Now analyze the health of the application tier itself. These are the goals:

  • Increase the CPU utilization of the server(s).
  • Decrease CPU utilization that is due to overhead.
  • More efficiently utilize the CPU through code tuning.
  • Tune the JVM parameters.

 

Increasing the CPU Utilization

Why do you want to increase the CPU utilization? Generally speaking, the point of a computer is to drive data through the CPU for processing. If the CPU is not at 100% load, then the computer is not being completely utilized. Eliminating the bottlenecks in the rest of the environment should increase the processor utilization on the application server(s). If that is not the case (which is quite common), then you must discover why.


I/O—Disk

There could be an Input/Output (IO) bottleneck. It does not take much disk activity to get processes backlogged. If the disk utilization is above 20%, you may be waiting on the disk. If you are using IDE disks, try replacing them with SCSI. If the disk utilization is still high, you might require a RAID device. If you do not have a RAID device available, you can emulate the effect of one by manually spreading your files across as many physical disks as possible. Make sure to spread the files across physical disks and not just logical disks. The goal is to get as many independent spindles and drive heads moving as possible.


I/O—Network

Another I/O-related problem to investigate is network utilization. This bottleneck could be considered part of the application tier or part of the entire environment depending on where it is most constrained. If you feel that the network could be a bottleneck, measure the traffic using a network sniffer. If the network utilization is not close to 100%, then the network is not your problem. If it is your problem, there are two solutions. The first is to reduce the size of the data that you are transmitting. For some applications this may be a simple task of removing extraneous data such as unnecessary graphics from transmissions. If changing the data size is not an option, a cost-effective approach is to upgrade just a portion of the network (see Figure 2). This can be implemented using most existing hardware through gigabit over copper. Additionally, some switches like the Intel® Express 510T Switch have the capability of inserting a gigabit fiber module. In either case, implement gigabit connectivity between your hardest hit point and the switch. In Figure 2, one application server is sending data to several Web servers. Each of the Web servers can operate unconstrained at 100 Mbps, but the connection between the application server and the switch is saturated.


Figure 2. Partial Gigabit Network


Synchronization

Synchronization is another very likely reason that your CPU utilization is low. Java programs typically suffer from oversynchronization. What happens is that rather than having several runnable threads, all but one thread is runnable and the others must wait on that thread. There are a few ways to resolve this. If the synchronization is within your code, you have the means to remove it. Often the synchronization is unnecessary. The method may have been synchronized only for testing purposes and then left that way. If the synchronization cannot be removed, it can probably be shortened. There is probably only a small area or one object that needs to be synchronized but a much larger chunk of code has been wrapped up with it. Paring out only what is needed will help improve the concurrency of your threads.

The synchronization may not even be in your code, though. Application servers provide many services in order to supply a wide array of functionality; therefore, the application server may be doing a number of things that are not necessary for your code. Often this leads to an undue amount of synchronization. Fortunately, there are some ways around it. Clustering may be the solution for you. Clustering is essentially running multiple instances of your application server. Technically, clustering involves communication between these instances but that might not be necessary for your application.

There are two main types of clustering: horizontal and vertical. Horizontal, sometimes called scaling out, refers to running the application on multiple servers. Vertical refers to running multiple instances of the application on the same server. The latter will provide benefits without the need to purchase more servers. If your systems are horizontally clustered, still try vertical clustering on those same machines. Application servers provide different mechanisms with different names such as "cloned" and "managed" that provide different flavors of clustering. Check your application server's documentation to find the right one for your application. Typically 2 to 4 instances of the application will provide the maximum efficiency.

If your application is already clustered and it is still suffering from low processor utilization, then look into the communication within the cluster. Vertical clustering can be maintained with memory-based or disk-based communication. If you have a high amount of I/O with the disk, you are going to have less than optimal performance. Look for application server properties like the persistence store type to change these settings.


Decreasing CPU Overhead

Aside from inefficient code, the main contributor to CPU overhead is context switching. A context switch is the process of saving the machine state of one thread and loading the machine state of the next thread. Every time that occurs the processor is doing housekeeping work (thread management) instead of work meaningful for your application. A good indication that context switching is a problem is if the % System Time is greater than 25%. (Context switching may still be a problem even if the % System Time is only greater than 10%.) The number of threads running directly affects the amount of context switching on the server. Application servers tend to default to around 15 threads per virtual machine. If you are running more than one instance of the application server, you will, of course, be multiplying that amount. Sometimes companies increase the number of threads thinking that it will improve concurrency because there are more threads ready to accept work. In fact, the opposite is usually true. There is an optimal number of threads and anything more than that will just increase the overhead due to context switching. You will have to determine through experimentation what the optimal number is for your environment. A good number to start with is a small multiple of the number of processors. For example, if you are running a 2-processor machine, look for an optimal number of threads between 4 and 8. If you are on an 8-processor machine, look between 8 and 24.


Tuning the Code

The lowest level operation is tuning the source code. The reason that tuning the code is delayed for so long is that the small changes made to the code will always be limited by the amount they are actually called. If the CPU is not fully utilized, the code will not come into play as often. For example, suppose you speed up a function by 20%, but that code is called only 10% of the time and the CPU is utilized only 50%, then your actual improvement is 50% of 10% of 20%, or 1%. This is called Amdahl's Law:


Figure 3. Amdahl's Law

Given that code tuning can be time consuming, ensure that you get the biggest bang for your buck, which means using a profiler program. A profiler attaches to your program and measures the time taken in each part of the program. Before buying a profiler, try out a few products. Because of the wide variety of uses for Java (J2EE, stand-alone applications, applets, servlets, and so on), Java profilers tend not to work well in every environment. For example, application servers tend to cause problems for profilers because the application servers often implement portions natively and step outside of the official specifications. This is another reason it is beneficial to take your application to a professional services group like Intel® Solution Services. They have a number of profilers on hand and can help you find the right one for your application.

Once you have identified a profiler that works well with your application, make it work well for you. Most profilers have several different ways to sort the information they collect. You will probably want to sort the functions by time taken for the function and its descendants. This shows the critical path and provides the largest improvement for your effort.

As you review the output, watch for areas where you may be "object thrashing," such as repeatedly creating and destroying objects, which causes excess overhead. Every time you create and destroy an object you contribute to time wasted doing garbage collection. Keep in mind that resizing an object will actually create a new larger object and destroy the original. Resizing an object on every iteration of a loop is bound to cause unnecessary overhead. Likewise, many objects are actually composites of several smaller objects. Creating and destroying one object with hundreds of parts is probably slowing down your application. Changing immutable objects like strings will also force the creation and deletion of an object. In short, even though it's easy to create objects and forget about them in Java, you must still treat them conservatively.


Java Virtual Machine Tuning

Changes to the virtual machine can also improve performance. Selecting the Java Virtual Machine (JVM) vendor is important. Some application servers ship with the default Sun JVM. This JVM does not score as well as others on Intel-based platforms according to ECPerf benchmarks. Depending on your application server, you may not have the luxury of changing the underlying JVM.

The next big impact on the JVM is setting the heap size. Generally speaking more is better, but there is often an optimal number. If the heap size is too small, the garbage collector must run frequently; if it is too large, the garbage collector runs infrequently but for a long time. You can often see the effect of garbage collection by monitoring the processor utilization for each processor. Figure 4 shows an 8-processor machine. Each processor is operating at around 75% until the garbage collection kicks off. At that point all but one processor is inactive. Vertical clustering can offset the impact of garbage collection because it will take place independently on each VM. That way there should always be at least one JVM available to do meaningful work.


Figure 4. Typical Effect of Garbage Collection

A number of JVM and compiler options can alter performance. Building your application with the -O flag (optimize) can increase performance but generally has no effect. (Some JDK versions completely ignore the -O flag.) The optimization flag instructs the compiler to inline methods (such as methods declared static final) and remove unused methods, though these tasks are increasingly accomplished by runtime features such as HotSpot.

Several JVM options can hurt performance such as Trace, Verbose, VerboseGC, NoJIT, NoClassGC, NoAssyncGC, MaxJStack, and Verify. Some JVMs provide an incremental garbage collector (-incgc) that offers some improvement. The impacts vary from application to application, but generally these options do not provide much of an improvement, so leave them for last and do not spend too much time on them.


Conclusion

The most important lesson to take away from this article is the importance of a structured approach to optimization. Evaluating your software environment from the top down and using the scientific method of one change per experiment is the most efficient use of your efforts. The other critical piece is to understand the indicators of your systems' health such as processor utilization, system time, and context switches. If at any stage of your tuning, you will benefit from additional optimization experience, please contact Intel® Solution Services.


About the Author

Dan Middleton has worked in the computer industry for the past eight years. He has been employed on a number of development projects ranging from medical imaging software to enterprise business applications. For the last two years he has been with Intel Corporation where he works with independent software vendors to optimize their products and evaluate their products' scalability. In addition to his work with Java enterprise applications, Dan also specializes in 3D graphics programming.