Mandalika's scratchpad: Java performance on Niagara platform

Mandalika's scratchpad

[ Work blog @Oracle | My Music Compositions ]

Old Posts: 09.04 10.04 11.04 12.04 01.05 02.05 03.05 04.05 05.05 06.05 07.05 08.05 09.05 10.05 11.05 12.05 01.06 02.06 03.06 04.06 05.06 06.06 07.06 08.06 09.06 10.06 11.06 12.06 01.07 02.07 03.07 04.07 05.07 06.07 08.07 09.07 10.07 11.07 12.07 01.08 02.08 03.08 04.08 05.08 06.08 07.08 08.08 09.08 10.08 11.08 12.08 01.09 02.09 03.09 04.09 05.09 06.09 07.09 08.09 09.09 10.09 11.09 12.09 01.10 02.10 03.10 04.10 05.10 06.10 07.10 08.10 09.10 10.10 11.10 12.10 01.11 02.11 03.11 04.11 05.11 07.11 08.11 09.11 10.11 11.11 12.11 01.12 02.12 03.12 04.12 05.12 06.12 07.12 08.12 09.12 10.12 11.12 12.12 01.13 02.13 03.13 04.13 05.13 06.13 07.13 08.13 09.13 10.13 11.13 12.13 01.14 02.14 03.14 04.14 05.14 06.14 07.14 09.14 10.14 11.14 12.14 01.15 02.15 03.15 04.15 06.15 09.15 12.15 01.16 03.16 04.16 05.16 06.16 07.16 08.16 09.16 12.16 01.17 02.17 03.17 04.17 06.17 07.17 08.17 09.17 10.17 12.17 01.18 02.18 03.18 04.18 05.18 06.18 07.18 08.18 09.18 11.18 12.18 01.19 02.19 05.19 06.19 08.19 10.19 11.19 05.20 10.20 11.20 12.20 09.21 11.21 12.22

Tuesday, November 21, 2006

Java performance on Niagara platform

(The goal of this blog post is not really to educate you on how to tune Java on UltraSPARC-T1 (Niagara) platform, but to warn you not to completely rely on the out-of-the-box features of Solaris 10 and Java, with couple of interesting examples).

Scenario:

Customer XYZ heard very good things about UltraSPARC-T1 (Niagara) based coolthreads servers; and the out of the box performance of Solaris 10 Update 1 and Java SE 5.0. So, he bought an US-T1 based T2000 server; and deployed his application on this server running the latest update of Solaris 10 with latest version of Java SE.

Pop Quiz:
Assuming he didn't tune the application further with the blind faith on all the things he heard, is he getting all the performance he is supposed to get from the Solaris run-time environment and the underlying hardware?

Answer:
No.

Here is why, with a simple example:

US-T1 chip supports 4 different page sizes: 8K, 64K, 4M and 256M.

% pagesize -a
8192
65536
4194304
268435456

As long as Solaris run-time takes care of mapping heap/stack/anon/library text of a process on to appropriate page sizes, we don't have to tweak anything for better performance at least from dTLB/iTLB hits perspective. However things are little different with Java Virtual Machine (VM). Java sets its own page size with memctl() interface - so, large page OOB feature of Solaris 10 Update 1 (and later) will have no impact on Java at all. The following mappings of a native process, and a Java process confirm this.

eg.,
Oracle shadow process using 256M pages for ISM (Solaris takes care of this mapping):

0000000380000000    4194304    4194304          -    4194304 256M rwxsR    [ ism shmid=0xb ]

Some anonymous mappings from a Java process (Java run-time take care of these mappings):

D8800000   90112   90112   90112       -   4M rwx--    [ anon ]
DE000000  106496  106496  106496       -   4M rwx--    [ anon ]
E4800000   98304   98304   98304       -   4M rwx--    [ anon ]
EA800000   57344   57344   57344       -   4M rwx--    [ anon ]
EE000000   45056   45056   45056       -   4M rwx--    [ anon ]

If Solaris run-time takes care of the above mappings, it would have mapped 'em on to one 256M page and the rest on other pages. So, are we losing (something that we cannot gain is a potential loss) any performance here by using 4M pages? Yes, we are. The following trapstat output gives us a hint that still there is at least 12% (check the last column, min of all %time values) CPU can be regained by switching to much larger page (256M in this example). But in reality we cannot avoid all memory translations completely - so, it is safe to assume that the potential gain by using 256M pages would be anywhere between 5% and 10%.

% grep ttl trapstat.txt
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
      ttl |    553995  0.9       711  0.0 |   6623798 11.0      4371  0.0 |12.0
      ttl |    724981  1.3       832  0.0 |   9509112 16.5      5969  0.1 |17.8
      ttl |    753761  1.3       661  0.0 |  11196949 19.7      4601  0.0 |21.1

Why didn't Java run-time use 256M pages even when it could potentially use that large page in this particular scenario?

The answer to this question is pretty simple - usually large pages (pages > default 8K pages) improve the performance of the process by reducing the number of CPU cycles spent on virtual <-> physical memory translations on-the-fly. The bigger the page size, the higher the chances for good performance. However the improvement in CPU performance with large pages is not completely free - we need to sacrifice little virtual memory due to the page alignment requirements. i.e., there will be an increase in the virtual memory consumption depending on the page size in use. When 4M pages are in use, we might be losing ~4M at the most. When 256M pages are in use, .. ? Well, you get the idea. Depending on the heap size, the performance difference between 4M and 256M pages might not be substantial for some applications - but there might be a big difference in the memory footprint with 4M and 256M pages. Due to this, Java SE development team chose 4M page size in favor of normal/lesser memory footprints; and provided a hook to the customers who wish to use different page sizes including 256M, in the form of -XX:LargePageSizeInBytes=pagesize[K|M] JVM option. That's why Java uses 4M pages by default even when it could use 256M pages.

It is up to the customers to check the dTLB/iTLB miss rate by running trapstat tool (eg., trapstat -T 5 5 ) and to decide if it helps to use 256M pages on Niagara servers with JVM option -XX:LargePageSizeInBytes=256M. Use pmap -sx <pid> to check the page size and the mappings.

eg.,
Some anonymous mappings from a Java process with -XX:LargePageSizeInBytes=256M option:

90000000  262144  262144  262144       - 256M rwx--    [ anon ]
A0000000  524288  524288  524288       - 256M rwx--    [ anon ]
C0000000  262144  262144  262144       - 256M rwx--    [ anon ]
E0000000  262144  262144  262144       - 256M rwx--    [ anon ]

Let us check the time spent in virtual-to-physical and physical-to-virtual memory translations again.

% grep ttl trapstat.txt
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
      ttl |    332797  0.5       523  0.0 |   2546510  3.8      2856  0.0 | 4.3
      ttl |    289876  0.4       382  0.0 |   1984921  2.7      3226  0.0 | 3.2
      ttl |    255998  0.4       438  0.0 |   2037992  2.9      2350  0.0 | 3.3

Now scroll up a little and compare the %time columns of both 4M and 256M page experiments. There is a noticeable difference in dtlb-miss rate - more than 8%. i.e., the performance gain by merely switching from 4M to 256M pages is ~8% CPU. Since the CPU is not spending/wasting some cycles on memory translations, it'd be doing more useful work and hence the throughput or response time from JVM would improve.

Another example:

Recent versions of Java SE support parallel garbage collection with the JVM switch -XX:+UseParallelGC. When this option is used on command line, by default Java run-time starts some garbage collection threads whose count is equal to the number of processors (including virtual processors). Niagara server acts like a 32-way server (capable of running 32 threads in parallel) - so, running the Java process with -XX:+UseParallelGC option may run 32 garbage collection threads, which would probably be unnecessarily high. Unless the garbage collection thread count is restricted to a decent number with another JVM switch -XX:ParallelGCThreads=<gcthreadcount>, customers may see very high system CPU utilization (> 20%); and misinterpret it as a problem with the Niagara server.

Moral of the story:

Unless you know the auto tune policy of the OS or the software that runs on top of it, do NOT just rely on their auto tuning capability. Measure the run-time performance of the application and tune it accordingly for better performance.

Suggested reading: