Tiered compilation, introduced in Java SE 7, brings client startup speeds to the server VM. Normally, a server VM uses the interpreter to collect profiling information about methods that is fed into the compiler. In the tiered scheme, in addition to the interpreter, the client compiler is used to generate compiled versions of methods that collect profiling information about themselves. Since the compiled code is substantially faster than the interpreter, the program executes with greater performance during the profiling phase. In many cases, a startup that is even faster than with the client VM can be achieved because the final code produced by the server compiler may be already available during the early stages of application initialization. The tiered scheme can also achieve better peak performance than a regular server VM because the faster profiling phase allows a longer period of profiling, which may yield better optimization.
Tiered compilation is now the default mode for the server VM. Both 32 and 64 bit modes are supported, as well as compressed oops (see the next section).
An "oop", or ordinary object pointer in Java Hotspot parlance, is a managed pointer to an object. An oop is normally the same size as a native machine pointer, which means 64 bits on an LP64 system. On an ILP32 system, maximum heap size is somewhat less than 4 gigabytes, which is insufficient for many applications. On an LP64 system, the heap used by a given program might have to be around 1.5 times larger than when it is run on an ILP32 system. This requirement is due to the expanded size of managed pointers. Memory is inexpensive, but these days bandwidth and cache are in short supply, so significantly increasing the size of the heap and only getting just over the 4 gigabyte limit is undesirable.
Managed pointers in the Java heap point to objects which are aligned on 8-byte address boundaries. Compressed oops represent managed pointers (in many but not all places in the JVM software) as 32-bit object offsets from the 64-bit Java heap base address. Because they're object offsets rather than byte offsets, they can be used to address up to four billion objects (not bytes), or a heap size of up to about 32 gigabytes. To use them, they must be scaled by a factor of 8 and added to the Java heap base address to find the object to which they refer. Object sizes using compressed oops are comparable to those in ILP32 mode.
The term decode is used to express the operation by which a 32-bit compressed oop is converted into a 64-bit native address into the managed heap. The inverse operation is referred to as encoding.
Compressed oops is supported and enabled by default in Java SE 6u23 and
later.
In Java SE 7, use of compressed oops is the default for 64-bit JVM
processes when -Xmx
isn't specified and for values of
-Xmx
less than 32 gigabytes. For JDK 6 before the 6u23
release, use the
-XX:+UseCompressedOops
flag with the java
command to enable the feature.
When using compressed oops in a 64-bit Java Virtual Machine process, the JVM software asks the operating system to reserve memory for the Java heap starting at virtual address zero. If the operating system supports such a request and can reserve memory for the Java heap at virtual address zero, then zero-based compressed oops are used.
Use of zero-based compressed oops means that a 64-bit pointer can be decoded from a 32-bit object offset without adding in the Java heap base address. For heap sizes less than 4 gigabytes, the JVM software can use a byte offset instead of an object offset and thus also avoid scaling the offset by 8. Encoding a 64-bit address into a 32-bit offset is correspondingly efficient.
For Java heap sizes up around 26 gigabytes, any of Solaris, Linux, and Windows operating systems will typically be able to allocate the Java heap at virtual address zero.
Escape analysis is a technique by which the Java Hotspot Server Compiler can analyze the scope of a new object's uses and decide whether to allocate it on the Java heap.
Escape analysis is supported and enabled by default in Java SE 6u23 and later.
The Java Hotspot Server Compiler implements the flow-insensitive escape analysis algorithm described in:
[Choi99] Jong-Deok Choi, Manish Gupta, Mauricio Seffano, Vugranam C. Sreedhar, Sam Midkiff, "Escape Analysis for Java", Procedings of ACM SIGPLAN OOPSLA Conference, November 1, 1999
Based on escape analysis, an object's escape state might be one of the following:
GlobalEscape
– An object escapes the method
and thread. For example, an object stored in a static field, or,
stored in a field of an escaped object, or, returned as the result
of the current method.ArgEscape
– An object passed as an argument
or referenced by an argument but does not globally escape during a
call. This state is determined by analyzing the bytecode of called
method.NoEscape
– A scalar replaceable object,
meaning its allocation could be removed from generated code.After escape analysis, the server compiler eliminates scalar replaceable object allocations and associated locks from generated code. The server compiler also eliminates locks for all non-globally escaping objects. It does not replace a heap allocation with a stack allocation for non-globally escaping objects.
Some scenarios for escape analysis are described next.
The server compiler might eliminate certain object allocations. Consider the example where a method makes a defensive copy of an object and returns the copy to the caller.
public class Person { private String name; private int age; public Person(String personName, int personAge) { name = personName; age = personAge; } public Person(Person p) { this(p.getName(), p.getAge()); } public int getName() { return name; } public int getAge() { return age; } } public class Employee { private Person person; // makes a defensive copy to protect against modifications by caller public Person getPerson() { return new Person(person) }; public void printEmployeeDetail(Employee emp) { Person person = emp.getPerson(); // this caller does not modify the object, so defensive copy was unnecessary System.out.println ("Employee's name: " + person.getName() + "; age: " + person.getAge()); } }
The method makes a copy to prevent modification of the original
object by the caller. If the compiler determines that the
getPerson
method is being invoked in a loop, it will
inline that method. In addition, through escape analysis, if the
compiler determines that the original object is never modified, it
might optimize and eliminate the call to make a copy.
The server compiler might eliminate synchronization blocks
(lock elision) if it determines that an object is thread
local. For example, methods of classes such as
StringBuffer
and Vector
are synchronized
because they can be accessed by different threads. However, in most
scenarios, they are used in a thread local manner. In cases where
the usage is thread local, the compiler might optimize and remove
the synchronization blocks.
The Parallel Scavenger garbage collector has been extended to take advantage of machines with NUMA (Non Uniform Memory Access) architecture. Most modern computers are based on NUMA architecture, in which it takes a different amount of time to access different parts of memory. Typically, every processor in the system has a local memory that provides low access latency and high bandwidth, and remote memory that is considerably slower to access.
In the Java HotSpot Virtual Machine, the NUMA-aware allocator has been implemented to take advantage of such systems and provide automatic memory placement optimizations for Java applications. The allocator controls the eden space of the young generation of the heap, where most of the new objects are created. The allocator divides the space into regions each of which is placed in the memory of a specific node. The allocator relies on a hypothesis that a thread that allocates the object will be the most likely to use the object. To ensure the fastest access to the new object, the allocator places it in the region local to the allocating thread. The regions can be dynamically resized to reflect the allocation rate of the application threads running on different nodes. That makes it possible to increase performance even of single-threaded applications. In addition, "from" and "to" survivor spaces of the young generation, the old generation, and the permanent generation have page interleaving turned on for them. This ensures that all threads have equal access latencies to these spaces on average.
The NUMA-aware allocator is available on the Solaris™ operating system starting in Solaris 9 12/02 and on the Linux operating system starting in Linux kernel 2.6.19 and glibc 2.6.1.
The NUMA-aware allocator can be turned on with the
-XX:+UseNUMA
flag in conjunction with the selection of
the Parallel Scavenger garbage collector. The Parallel Scavenger
garbage collector is the default for a server-class machine. The
Parallel Scavenger garbage collector can also be turned on
explicitly by specifying the -XX:+UseParallelGC
option.
The -XX:+UseNUMA
flag was added in Java SE 6u2.
When evaluated against the SPEC JBB 2005 benchmark on an 8-chip Opteron machine, NUMA-aware systems showed the following performance increases: