Mandalika's scratchpad [ Work blog @Oracle | Stock Market Notes | My Music Compositions ]

Old Posts: 09.04  10.04  11.04  12.04  01.05  02.05  03.05  04.05  05.05  06.05  07.05  08.05  09.05  10.05  11.05  12.05  01.06  02.06  03.06  04.06  05.06  06.06  07.06  08.06  09.06  10.06  11.06  12.06  01.07  02.07  03.07  04.07  05.07  06.07  08.07  09.07  10.07  11.07  12.07  01.08  02.08  03.08  04.08  05.08  06.08  07.08  08.08  09.08  10.08  11.08  12.08  01.09  02.09  03.09  04.09  05.09  06.09  07.09  08.09  09.09  10.09  11.09  12.09  01.10  02.10  03.10  04.10  05.10  06.10  07.10  08.10  09.10  10.10  11.10  12.10  01.11  02.11  03.11  04.11  05.11  07.11  08.11  09.11  10.11  11.11  12.11  01.12  02.12  03.12  04.12  05.12  06.12  07.12  08.12  09.12  10.12  11.12  12.12  01.13  02.13  03.13  04.13  05.13  06.13  07.13  08.13  09.13  10.13  11.13  12.13  01.14  02.14  03.14  04.14  05.14  06.14  07.14  09.14  10.14 


Tuesday, November 21, 2006
 
Java performance on Niagara platform

(The goal of this blog post is not really to educate you on how to tune Java on UltraSPARC-T1 (Niagara) platform, but to warn you not to completely rely on the out-of-the-box features of Solaris 10 and Java, with couple of interesting examples).

Scenario:

Customer XYZ heard very good things about UltraSPARC-T1 (Niagara) based coolthreads servers; and the out of the box performance of Solaris 10 Update 1 and Java SE 5.0. So, he bought an US-T1 based T2000 server; and deployed his application on this server running the latest update of Solaris 10 with latest version of Java SE.

Pop Quiz:
Assuming he didn't tune the application further with the blind faith on all the things he heard, is he getting all the performance he is supposed to get from the Solaris run-time environment and the underlying hardware?

Answer:
No.

Here is why, with a simple example:

US-T1 chip supports 4 different page sizes: 8K, 64K, 4M and 256M.
% pagesize -a
8192
65536
4194304
268435456

As long as Solaris run-time takes care of mapping heap/stack/anon/library text of a process on to appropriate page sizes, we don't have to tweak anything for better performance at least from dTLB/iTLB hits perspective. However things are little different with Java Virtual Machine (VM). Java sets its own page size with memctl() interface - so, large page OOB feature of Solaris 10 Update 1 (and later) will have no impact on Java at all. The following mappings of a native process, and a Java process confirm this.

eg.,
Oracle shadow process using 256M pages for ISM (Solaris takes care of this mapping):
0000000380000000    4194304    4194304          -    4194304 256M rwxsR    [ ism shmid=0xb ]

Some anonymous mappings from a Java process (Java run-time take care of these mappings):
D8800000   90112   90112   90112       -   4M rwx--    [ anon ]
DE000000 106496 106496 106496 - 4M rwx-- [ anon ]
E4800000 98304 98304 98304 - 4M rwx-- [ anon ]
EA800000 57344 57344 57344 - 4M rwx-- [ anon ]
EE000000 45056 45056 45056 - 4M rwx-- [ anon ]

If Solaris run-time takes care of the above mappings, it would have mapped 'em on to one 256M page and the rest on other pages. So, are we losing (something that we cannot gain is a potential loss) any performance here by using 4M pages? Yes, we are. The following trapstat output gives us a hint that still there is at least 12% (check the last column, min of all %time values) CPU can be regained by switching to much larger page (256M in this example). But in reality we cannot avoid all memory translations completely - so, it is safe to assume that the potential gain by using 256M pages would be anywhere between 5% and 10%.

% grep ttl trapstat.txt
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
ttl | 553995 0.9 711 0.0 | 6623798 11.0 4371 0.0 |12.0
ttl | 724981 1.3 832 0.0 | 9509112 16.5 5969 0.1 |17.8
ttl | 753761 1.3 661 0.0 | 11196949 19.7 4601 0.0 |21.1

Why didn't Java run-time use 256M pages even when it could potentially use that large page in this particular scenario?

The answer to this question is pretty simple - usually large pages (pages > default 8K pages) improve the performance of the process by reducing the number of CPU cycles spent on virtual <-> physical memory translations on-the-fly. The bigger the page size, the higher the chances for good performance. However the improvement in CPU performance with large pages is not completely free - we need to sacrifice little virtual memory due to the page alignment requirements. i.e., there will be an increase in the virtual memory consumption depending on the page size in use. When 4M pages are in use, we might be losing ~4M at the most. When 256M pages are in use, .. ? Well, you get the idea. Depending on the heap size, the performance difference between 4M and 256M pages might not be substantial for some applications - but there might be a big difference in the memory footprint with 4M and 256M pages. Due to this, Java SE development team chose 4M page size in favor of normal/lesser memory footprints; and provided a hook to the customers who wish to use different page sizes including 256M, in the form of -XX:LargePageSizeInBytes=pagesize[K|M] JVM option. That's why Java uses 4M pages by default even when it could use 256M pages.

It is up to the customers to check the dTLB/iTLB miss rate by running trapstat tool (eg., trapstat -T 5 5 ) and to decide if it helps to use 256M pages on Niagara servers with JVM option -XX:LargePageSizeInBytes=256M. Use pmap -sx <pid> to check the page size and the mappings.

eg.,
Some anonymous mappings from a Java process with -XX:LargePageSizeInBytes=256M option:
90000000  262144  262144  262144       - 256M rwx--    [ anon ]
A0000000 524288 524288 524288 - 256M rwx-- [ anon ]
C0000000 262144 262144 262144 - 256M rwx-- [ anon ]
E0000000 262144 262144 262144 - 256M rwx-- [ anon ]

Let us check the time spent in virtual-to-physical and physical-to-virtual memory translations again.
% grep ttl trapstat.txt
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
ttl | 332797 0.5 523 0.0 | 2546510 3.8 2856 0.0 | 4.3
ttl | 289876 0.4 382 0.0 | 1984921 2.7 3226 0.0 | 3.2
ttl | 255998 0.4 438 0.0 | 2037992 2.9 2350 0.0 | 3.3

Now scroll up a little and compare the %time columns of both 4M and 256M page experiments. There is a noticeable difference in dtlb-miss rate - more than 8%. i.e., the performance gain by merely switching from 4M to 256M pages is ~8% CPU. Since the CPU is not spending/wasting some cycles on memory translations, it'd be doing more useful work and hence the throughput or response time from JVM would improve.

Another example:

Recent versions of Java SE support parallel garbage collection with the JVM switch -XX:+UseParallelGC. When this option is used on command line, by default Java run-time starts some garbage collection threads whose count is equal to the number of processors (including virtual processors). Niagara server acts like a 32-way server (capable of running 32 threads in parallel) - so, running the Java process with -XX:+UseParallelGC option may run 32 garbage collection threads, which would probably be unnecessarily high. Unless the garbage collection thread count is restricted to a decent number with another JVM switch -XX:ParallelGCThreads=<gcthreadcount>, customers may see very high system CPU utilization (> 20%); and misinterpret it as a problem with the Niagara server.

Moral of the story:

Unless you know the auto tune policy of the OS or the software that runs on top of it, do NOT just rely on their auto tuning capability. Measure the run-time performance of the application and tune it accordingly for better performance.

Suggested reading:_________________
Technorati Tags:
| | | | |


Sunday, November 19, 2006
 
Solaris: Workaround for incorrect LUN size issue

Scenario:
You created a logical drive of capacity x GB, and mapped it so there is a LUN (Logical Unit Number) with size x GB. When you run format command, Solaris shows incorrect size for the logical drive.

eg.,
Partition table showing only 409 GB, where it is supposed to show 816 GB.
partition> p
Current partition table (original):
Total disk cylinders available: 53233 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 16 133.88MB (17/0/0) 274193
1 swap wu 17 - 33 133.88MB (17/0/0) 274193
2 backup wu 0 - 53232 409.41GB (53233/0/0) 858595057
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 usr wm 34 - 53232 409.15GB (53199/0/0) 858046671
7 unassigned wm 0 0 (0/0/0) 0

It could be a Solaris bug. However the following steps may fix the issue and show the real size of the LUN.

Note:
I'm no storage expert - just outlining the steps that helped me resolving the issue. May be there are better/simpler ways to resolve this issue which I do not know yet.

Steps:

Run the following commands as root user:

# touch /reconfigure
# reboot

# format

Select the disk and check the partition table. Do you see the configured size? If yes, you are done - stop here. If not, go to the next step.

eg., continued ..
# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2/sd@0,0
1. c1t1d0
/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2/sd@1,0
2. c2t216000C0FFD7E5FBd0 <SUN-StorEdge3510-411I cyl 65533 alt 2 hd 64 sec 408>
/pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0/ssd@w216000c0ffd7e5fb,0
3. c2t216000C0FFD7E5FBd1 <SUN-StorEdge3510-411I cyl 47588 alt 2 hd 64 sec 255>
/pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0/ssd@w216000c0ffd7e5fb,1

Specify disk (enter its number): 2
selecting c2t216000C0FFD7E5FBd0
[disk formatted]
Warning: Current Disk has mounted partitions.


FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
! - execute , then return
quit

format> p

PARTITION MENU:
0 - change `0' partition
1 - change `1' partition
2 - change `2' partition
3 - change `3' partition
4 - change `4' partition
5 - change `5' partition
6 - change `6' partition
7 - change `7' partition
select - select a predefined table
modify - modify a predefined partition table
name - name the current table
print - display the current table
label - write partition map and label to the disk
! - execute , then return
quit

partition> p
Current partition table (original):
Total disk cylinders available: 53233 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 16 133.88MB (17/0/0) 274193
1 swap wu 17 - 33 133.88MB (17/0/0) 274193
2 backup wu 0 - 53232 409.41GB (53233/0/0) 858595057
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 usr wm 34 - 53232 409.15GB (53199/0/0) 858046671
7 unassigned wm 0 0 (0/0/0) 0

In this example the issue wasn't resolved just by rebooting the server with /reconfigure file in root file system.

Now go back one level by quitting the partition table screen; and then type in the word 'type'. When you are shown the available drive types, select '0. Auto configure' option by typing 0. Usually this step would fix the issue and show the right LUN size. Just label the disk by selecting 'label' option and verify the partition table one more time to see if it is showing the right size.

eg., continued ..
partition> q

format> type

AVAILABLE DRIVE TYPES:
0. Auto configure
1. Quantum ProDrive 80S
2. Quantum ProDrive 105S
3. CDC Wren IV 94171-344
4. SUN0104
5. SUN0207
6. SUN0327
7. SUN0340
8. SUN0424
9. SUN0535
10. SUN0669
11. SUN1.0G
12. SUN1.05
13. SUN1.3G
14. SUN2.1G
15. SUN2.9G
16. Zip 100
17. Zip 250
18. Peerless 10GB
19. SUN72G
20. SUN-StorEdge3510-411I
21. SUN-StorEdge3510-411I
22. SUN-StorEdge3510-411I
23. other

Specify disk type (enter its number)[21]: 0
c2t216000C0FFD7E5FBd0: configured with capacity of 815.96GB
<SUN-StorEdge3510-411I cyl 65533 alt 2 hd 64 sec 408>
selecting c2t216000C0FFD7E5FBd0
[disk formatted]

format> current
Current Disk = c2t216000C0FFD7E5FBd0
<SUN-StorEdge3510-411I cyl 65533 alt 2 hd 64 sec 408>
/pci@7c0/pci@0/pci@1/pci@0,2/SUNW,qlc@1/fp@0,0/ssd@w216000c0ffd7e5fb,0

format> label
Ready to label disk, continue? y

format> p

PARTITION MENU:
0 - change `0' partition
1 - change `1' partition
2 - change `2' partition
3 - change `3' partition
4 - change `4' partition
5 - change `5' partition
6 - change `6' partition
7 - change `7' partition
select - select a predefined table
modify - modify a predefined partition table
name - name the current table
print - display the current table
label - write partition map and label to the disk
! - execute , then return
quit

partition> p
Current partition table (default):
Total disk cylinders available: 65533 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 10 140.25MB (11/0/0) 287232
1 swap wu 11 - 21 140.25MB (11/0/0) 287232
2 backup wu 0 - 65532 815.96GB (65533/0/0) 1711197696
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 usr wm 22 - 65532 815.69GB (65511/0/0) 1710623232
7 unassigned wm 0 0 (0/0/0) 0

partition> q

In this particular example, we can see that the issue is resolved. But in some other cases it may not fix the real issue. In such scenarios go ahead and file a bug against Solaris storage management at bugs.opensolaris.org.

Acknowledgements:
Thanks to Robert Cohen for the tip in thread: Solaris 'format' not seeing new size of LUN after expansion on SAN.
________________
Technorati Tags:
| |


Thursday, November 16, 2006
 
Solaris: Disabling Out Of The Box (OOB) Large Page Support

Starting with the release of Solaris 10 1/06 (aka Solaris 10 Update 1), large page OOB feature turns on MPSS (Multiple Page Size Support) automatically for applications' data (heap) and text (libraries).

One obvious advantage of this large page OOB feature is that it improves the performance of user land applications by reducing the wastage of CPU cycles in serving iTLB and dTLB misses. For example, if the heap size of a process is 256M, on a Niagara (UltraSPARC-T1) box it will be mapped on to a single 256M page. On a system that doesn't support large pages, it will be mapped on to 32,768 8K pages. Now imagine having all the words of a story on a single large page versus having the words spread into 32,500+ small pages. Which one do you prefer?

However large page OOB feature may have negative impact on some applications - eg., application may crash due to some wrong assumption(s) about the page size {by the application} or there could be an increase in virtual memory consumption due to the way the data and libraries are mapped on to larger pages.

Fortunately Solaris provides a bunch of /etc/system tunables to enable/disable large page OOB support.

/etc/system tunables to disable large page OOB feature

Tuning off large page OOB support for heap/stack/anon pages on-the-fly

Setting /etc/system parameters require the system to be rebooted to enable/disable large page OOB support. However it is possible to set the desired page size for heap/stack/anon pages dynamically as shown below. Note that the system goes back to the default behavior when it is rebooted. Depending on the need to turn off large page support, use mdb or /etc/system parameters at your discretion.

To turn off large page support for heap, stack and anon pages dynamically, set the following under mdb -kw:
Note:
Java sets its own page size with memctl() interface - so, the /etc/system changes won't impact Java at all. Consider using the JVM option -XX:LargePageSizeInBytes=pagesize[K|M] to set the desired page size for Java process mappings.

How to check whether disabling large page support is really helping?

Compare the outputs of the following {along with application specific data} before and after changes:

How to set the maximum large page size?

Run pagesize -a to get the list of supported page sizes for your platform. Then set the page size of your choice as shown below.

% mdb -kw
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba random fcp fctl nca lofs ssd logindmux ptm cpc sppp crypto nfs ipc ]
> auto_lpg_maxszc/W <hex_value>

where:
hex_value = { 0x0 for 8K,
0x1 for 64K,
0x2 for 512K,
0x3 for 4M,
0x4 for 32M and
0x5 for 256M }

How to check the maximum page size in use?

Here is an example from a Niagara box (T2000):
% pagesize -a
8192
65536
4194304
268435456


% mdb -kw
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba random fcp fctl nca lofs ssd logindmux ptm cpc sppp crypto nfs ipc ]
> auto_lpg_maxszc/X
auto_lpg_maxszc:
auto_lpg_maxszc:5
> ::quit

See Also:
6287398 vm 1.5 dumps core with -d64

Acknowledgements:
Sang-Suan Sam Gam
___________________
Technorati tags:
| | |


Monday, November 13, 2006
 
Sun: OpenJDK

Open source JDK, that is. Sun Microsystems did it again -- As promised during JavaOne event back in May 2006, Sun made the implementation of the Java Platform, Standard Edition, available to the community under GNU General Public License (GPLv2).

Note that only HotSpot Virtual Machine and javac (compiler) components of the earlier builds of JDK version 7 are open sourced as of now. Rest of the components will be open sourced over the time; and by the end of first half of 2007, we will have fully buildable implementation for JDK 7.

OpenJDK web site and download location

OpenJDK home page:
https://openjdk.dev.java.net/

Source code download location:
JDK 7 build 02 source code

Browsable source code

Frequently Asked Questions:
Free and Open Source Java FAQ

Related information

The following list shows Sun's presence in open source world:
OpenSolaris
OpenSPARC
OpenJDK
OpenOffice (office suite)
NetBeans (IDE)
GlassFish (Java EE 5 Application Server)
Project Looking Glass (3D desktop)
...

____________
Technorati tags:
Sun | Java | JDK | OpenJDK | Open Source


Friday, November 10, 2006
 
Oracle: Explain plan & Tracing a particular SQL

Scenario:
You are on a mission to fix majority of database related performance issues in production environment - so, you are actively taking snapshots of the database during peak hours and generating AWR reports for the performance data.

Now you have the list of long running SQLs under SQL ordered by Elapsed Time section of the report. One of the next steps is to trace such SQLs to see what is happening when they get executed. Since we can extract the SQL identifier (SQL Id) from the AWR report for all top SQLs, tracing can be enabled as shown below.
  1. Get the session id (sid) and serial# for the sql_id from active sessions.
    % sqlplus / as sysdba
    SQL> select sid, serial# from v$session where sql_id='<sql_id>';

    If you wish to see the corresponding SQL text, run the following:

    SQL> select sql_text from v$sql where sql_id='<sql_id>';


  2. Enable SQL tracing for any session as follows:

    SQL> exec dbms_system.set_ev(<sid>, <serial#>, 10046, <level>, '');

    Event 10046 generates detailed information on statement parsing, values of bind variables, and wait events occurred during a particular session.

    Level = 1, 4, 8 or 12. Check Diagnostic event 10046 for more information about these levels.

    To disable tracing:

    SQL> exec dbms_system.set_ev(<sid>, <serial#>, 10046, 0, '');


  3. Check the trace file(s) under udump directory.

Note:
The above steps may not make much sense with short lived sessions. An alternate option is to enable system wide tracing for all sessions as shown here:
% sqlplus / as sysdba
SQL> alter system set events '10046 trace name context forever, level level';

To disable:

SQL> alter system set events '10046 trace name context off';

I'm pretty sure that there might be better ways to collect this information. I'll update this blog entry when I find simple alternative ways.

Generating explain plan for a SQL

Explain plan will have details related to Oracle's decisions about certain things like whether to use indexes or not, or which one to use if there are more than one index. Such a plan can be generated as shown here:
SQL> set pages 100
SQL> set lines 132
SQL> select plan_table_output from table(dbms_xplan.display_cursor('<sql_id>',0));

The generated output will be something similar to:

--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 10 (100)| |
| 1 | SORT ORDER BY | | 2 | 448 | 9 (56)| 00:00:01 |
| 2 | UNION-ALL | | | | | |
| 3 | NESTED LOOPS | | 1 | 191 | 4 (0)| 00:00:01 |
|* 4 | TABLE ACCESS BY INDEX ROWID | WF_EVENT_SUBSCRIPTIONS | 1 | 118 | 3 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | WF_EVENT_SUBSCRIPTIONS_N1 | 1 | | 2 (0)| 00:00:01 |
|* 6 | TABLE ACCESS BY INDEX ROWID | WF_EVENTS | 1 | 73 | 1 (0)| 00:00:01 |
|* 7 | INDEX UNIQUE SCAN | WF_EVENTS_U1 | 1 | | 0 (0)| |
| 8 | NESTED LOOPS | | 1 | 257 | 5 (0)| 00:00:01 |
| 9 | NESTED LOOPS | | 1 | 223 | 5 (0)| 00:00:01 |
| 10 | NESTED LOOPS | | 1 | 191 | 4 (0)| 00:00:01 |
|* 11 | TABLE ACCESS BY INDEX ROWID| WF_EVENTS | 1 | 73 | 2 (0)| 00:00:01 |
|* 12 | INDEX UNIQUE SCAN | WF_EVENTS_U2 | 1 | | 1 (0)| 00:00:01 |
|* 13 | TABLE ACCESS BY INDEX ROWID| WF_EVENT_SUBSCRIPTIONS | 1 | 118 | 2 (0)| 00:00:01 |
|* 14 | INDEX RANGE SCAN | WF_EVENT_SUBSCRIPTIONS_N1 | 1 | | 1 (0)| 00:00:01 |
|* 15 | TABLE ACCESS BY INDEX ROWID | WF_EVENTS | 1 | 32 | 1 (0)| 00:00:01 |
|* 16 | INDEX UNIQUE SCAN | WF_EVENTS_U1 | 1 | | 0 (0)| |
|* 17 | INDEX UNIQUE SCAN | WF_EVENT_GROUPS_U1 | 1 | 34 | 0 (0)| |
--------------------------------------------------------------------------------------------------------------

Acknowledgements:
Ahmed Alomari

_______________
Technorati tags:
|


Wednesday, November 08, 2006
 
Oracle: Snapshots and AWR report

Starting with Oracle 10g database management system, Oracle offers a set of scripts to extract performance statistics from Automatic Workload Repository (AWR) to generate a human readable report. This report will be a starting point in finding {and fixing them sometimes, almost with no additional effort} the bottlenecks in the database system.

Oracle automatically generates snapshots of the performance data once every hour and stores the statistics in the workload repository. While diagnosing some issues, it might be necessary to take several snapshots manually. create_snapshot procedure can be used to create a database snapshot manually, as shown below:

    % sqlplus / as sysdba
    SQL> exec dbms_workload_repository.create_snapshot();

Note down the date and time of all such snapshots; and use the corresponding snap IDs while generating the AWR report for a certain interval.

How to generate an AWR report?

Simply run the awrrpti.sql as shown below. The questions from the script are pretty straight forward to answer.

    sqlplus / as sysdba
    SQL> @$ORACLE_HOME/rdbms/admin/awrrpti.sql

How to interpret an AWR report?

Here are some useful resources (note that statspack interpretation is still applicable to AWR):
For more information:
__________
Technorati tags:
|



2004-2014 

This page is powered by Blogger. Isn't yours?