In general, we can classify the customers into:
- those who wants the latest copy of the application, but do not want to upgrade their OS to a new version
- those who do not want to upgrade the application, but wants to upgrade their OS to a new version
- those who do not want to upgrade either the application or the OS version
- those who wants to upgrade both the application and the OS version
Decision Making: Worth considering Solaris 10?
Let's start with some facts
- Solaris 10 is FREE, even for commercial use
- Hoooray! Way to go, Sun! =)
- Solaris 10 is a faster OS with tons of new features
- Heard a lot about it at blogs.sun.com; and I'm aware of the 2 million downloads too. Really looking forward to use DTrace, (virtual) zones on my Sun Fire v1280
- Solaris 10 is not a supported platform for Siebel 7.7
- I know that only Solaris 8 & 9 are the supported versions; and I'm pretty sure that I will not be getting any support for the issues on Solaris 10. But I 've confidence on Sun's promise on ABI and compatibility (see item 6)
- Siebel Enterprise server takes care of the application logic and actually all the application specific data be stored in a database
- This is the most important thing, driving me to make this bold attempt. Because of this, I know that I'm not going to lose any data; and in the worst case, I may not be able to run Siebel 7.7 on Solaris 10. That's fine with me
- Siebel 8.0 will be available on Solaris 10 (or 10U1)
- Nice; Win-Win-Win situation for Sun-Siebel-Customer. But I can't wait until Siebel 8.0 is available
- Sun always boasts about stable ABI (Application Binary Interface) & binary compatibility
- Hope Sun and Siebel wont let me down by introducing compatibility issues
Binary compatibility
Binary compatibility of an OS is the ability to run application(s) that were built for one version of OS, on later versions of OS without having to change or rebuild the application; but the same application may not run on earlier versions of the operating system
Because of the Solaris' binary compatibility (ABI stability), all the applications that were compiled on previous versions of Solaris, continue to run on the later versions of the OS, unless the application steps aside and abuses some non-standard or internal interfaces of the OS. All the non-standard, internal stuff is bound to break any time; and that's the reason why there is much insistence on using standard interfaces. If the application continues to work on later versions of the OS, even in the presence of non-standard interfaces, it must be mere luck and there is no guarantee that it is going to work in the future releases of the OS.
Okay, Okay! I'm convinced with Sun (Solaris) binary compatibility; but I'm not sure about Siebel's compliance with standard interfaces. Solaris 8 and later versions, ship a tool called
appcert
to examine application's use of unstable Solaris interfaces.appcert
From the man page of
appcert
:appcert
checks for:- Private symbol usage in Solaris libraries
- These are private symbols, that is, functions or data, that are not intended for developer consumption
- Static linking
- In particular, this refers to static linking of archives libc.a, libsocket.a, and libnsl.a. Because the semantics of private symbol calls from one Solaris library to another can change from one release to another, it is not a good practice to hardwire library code into your binary objects
- Unbound symbols
- These are library symbols (that is, functions or data) that the dynamic linker could not resolve when appcert was run. This might be an environment problem (for example, LD_LIBRARY_PATH) or a build problem (for example, not specifying -llib and/or -z defs with compiling). They are flagged to point these problems out and in case a more serious problem is indicated
appcert
tool. Since appcert
has the ability to examine all executables and shared libraries of a product, I just specified the top level directory of the Siebel installation, as an argument to appcert
. It took quite a while (nearly 2-3 hrs) to examine all the application binaries, and finally printed a detailed report. From the report, it appears that most of the warnings are about unbound symbolsSample
appcert
session:% appcert libsslcshar.soSince
finding executables and shared libraries to check ...
Shared libraries were found in the application and the
following directories are appended to LD_LIBRARY_PATH:
./.
profiling: libsslcshar.so
determining list of Solaris libraries ...
checking binary objects for unstable practices ...
checking: libsslcshar.so
performing miscellaneous checks ...
------------------------------------------------------------------------
Summary: No binary stability problems detected.
A total of 1 binary objects were examined.
The following (1 of 1) components had no problems detected:
libsslcshar.so
Additional output regarding private symbols usage and other
data is in the directory:
/tmp/appcert.3001
see the appcert documentation for more information.
appcert
couldn't find any potential issues, that may break the application wildly on Solaris 10, I've decided to install the application on Solaris 10 in a proper way, so it could handle some of our user traffic. As enhancements to Solaris 10, are available in the form of Solaris Express Nevada builds, I chose Solaris Express 3/05 (was the latest one, by then) as my base OS. One noticeable thing is uname -r
returned 5.10.1
, instead of 5.10
or 5.11
Siebel 7.7 installation on Solaris 5.10.1
The installation failed during the system requirements check, itself. Though the installer didn't show any specific error message, with little effort, I found that the installer didn't like an extra dot (.) in the version of the OS. It is expecting something like
5.x
, but not 5.x.y
. Since the installer is refusing to make any progress in the installation, I have no other choice but to install it on any machine running Solaris 8 or 9, with the configuration that I'm going to use on my Solaris 10.1 box; and to complete the installation on Solaris 10.1, with the help of Siebel tools install_gateway, install_server
install_eappweb
. The complete steps are as follows:- Generate the configuration response files
Install Siebel 7.7 Gateway, Enterprise Server and Web Server Extension on Solaris 8 or 9. During the installation, enter all configuration options that you want to use on the Solaris 10.1 - Move the Gateway, Enterprise and Web server extension installation directories to the Solaris 10.1 machine; and make sure the installation directory, host name, and web server directory etc., are pointing to the right ones
- Finally, finish the installation on Solaris 10.1 with the following commands:
- Gateway server
cd gtwysrvr/install_script/install
./install_gateway -S -l enu -r <target-directory> - Siebel server
cd siebsrvr/install_script/install
./install_server -S -l enu -o "enu" -r <target-directory>
cd siebsrvr/bin
./siebelmwsslsetting.ksh <target-directory>
./apache_3rdparty_link.ksh <target-directory> - Web server extension (SWE)
cd <installation-directory>/install_script/install
./install_eappweb -S -l enu -L "enu" -r <target-directory>
- Gateway server
dtrace
. Since the installer is going to get the OS version by calling uname -r
, I need to execute a small dtrace script which has entry, return points for the system call uname
static int uname(struct utsname *);
is the signature of uname()
and utsname
has the following structure:struct utsname {
char sysname[_SYS_NMLN];
char nodename[_SYS_NMLN];
char release[_SYS_NMLN];
char version[_SYS_NMLN];
char machine[_SYS_NMLN];
};
utsname
structure, and uname()
declarations are in the header file: utsname.hThe following little DTrace script would be helpful in interposing the system release information, on the fly
% sysrelease.d
#!/usr/sbin/dtrace -Cws
#include <sys/utsname.h>
syscall::uname:entry
{
this->in = (struct utsname *)arg0;
}
syscall::uname:return
{
copyoutstr($$1, (uintptr_t)&this->in->release[0], SYS_NMLN);
}
This script accepts an argument, the version# of our choice; so as long as the script is running, the OS returns our input as output to
uname -r
. Once the script exits, the OS returns the true version. Since this approach is simple, I've decided to feed 5.11
to this script, and finish the installation.% uname -aThe installation was successful. Then I spent some time testing one of the Siebel applications manually, and then configured Siebel Enterprise server to handle a load of 500 users, with the confidence that even if something fails blatantly, there won't be any severe data loss.
SunOS sunfire4 5.10.1 snv_10 sun4u sparc SUNW,Sun-Fire-1280R
% ./sysrelease.d 5.11
dtrace: script './sysrelease.d' matched 2 probes
dtrace: allowing destructive actions
.. script waits here doing nothing ..
In another window:
% uname -a
SunOS sunfire4 5.11 snv_10 sun4u sparc SUNW,Sun-Fire-1280R
Complete the Siebel installation. Once it is done, go back to the
previous window, and stop the script by pressing Ctrl-C
^C
% uname -a
SunOS sunfire4 5.10.1 snv_10 sun4u sparc SUNW,Sun-Fire-1280R
So it was in production, and everybody was happy until the server was choked when there is a significant increase in the number of concurrent requests from the online users. Some of the users who are able to connect to the application, are able to continue with their work; but those new users who want to access the applications like Financial Services, eChannel, couldn't quite succeed with consistent
Server busy
message from Siebel enterprise server. It appears to be an issue with OS, as the resource utilization on the machine was very low, yet the server is not able to handle the new requests. It is easy to speculate that the problem be lying some where in the OS configuration, because the same Siebel binary with similar configuration, works well on Solaris 9. Siebel enterprise logged hundreds of messages with the error: SBL-SCB-00011: Failed to connect to pipe (SEBL_0_10340) on process 10340
ie., Siebel Connection Broker component is not able to establish a connection with one of the object managers with pid 10340
. Due to a design flaw, instead of giving up and sending the requests to the other idle object managers, the SCBroker component keeps sending the new requests to the misbehaving object manager. This behavior is due to the built-in round-robin load balancing mechanism of SCBroker. Some of the output from truss
tracing:10334/1: accept(17, 0xFFBFECEC, 0xFFBFECFC, SOV_DEFAULT) = 20From the following lines, it appears that it has to do something with the AF_INET sockets on UNIX
10334/1: AF_INET name = 192.20.125.41 port = 52624
10334/1: mprotect(0xF6900000, 40960, 0x0007) = 0
10334/1: mprotect(0xF6900000, 40960, 0x0001) = 0
10334/1: pollsys(0xFFBFC460, 1, 0xFFBFE4C8, 0x00000000) = 1
10334/1: fd=20 ev=POLLRDNORM rev=POLLRDNORM
10334/1: timeout: 6.000000000 sec
10334/1: recv(20, 0xFFBFEB80, 256, 2) = 256
10334/1: P O S T h t t p : / / s u n f i r e 4 : 2 3 2 1 / s i
10334/1: e b e l / f i n s o b j m g r _ e n u / r r H T T P / 1 . 1\r
10334/1: \n H o s t : s u n f i r e 4\r\n C o n t e n t - T y p
10334/1: e : A p p l i c a t i o n / o c t e t - s t r e a m\r\n C o n
10334/1: t e n t - L e n g t h : 1 0 4\r\n X - S i e b e l - D i g e s
10334/1: t : b m + A B b R s P D N H t D T C x l j K U M v M y Z E =\r
10334/1: \n\r\n\0\0\0 d\0\0\0\0\0\0\001\0\0\0\0\0\0\0\0\0\0\001\0\0\0\0\0
10334/1: \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\001\0\0\0\f\0\0\0 ,\0
10334/1: so_socket(PF_UNIX, SOCK_STREAM, 0, 0x00000000, SOV_DEFAULT) = 21
10334/1: 0x00000000: ""
10334/1: connect(21, 0xFFBFE0E8, 110, SOV_DEFAULT) Err#146 ECONNREFUSED
10334/1: AF_UNIX name = /export/home/giri/18104/siebsrvr/temp/SEBL_0_10340
10334/1: close(21) = 0
10334/1: time() = 1109897588
10334/1: write(19, 0x0057758C, 160) = 160
10334/1: G e n e r i c L o g\t G e n e r i c E r r o r\t 1\t 0\t 2 0 0 5
10334/1: - 0 3 - 0 3 1 7 : 5 3 : 0 8\t ( s c b c o m p . c p p ( 8 2
10334/1: 2 ) e r r = 7 1 0 0 0 1 1 s y s = 0 ) S B L - S C B - 0 0
10334/1: 0 1 1 : F a i l e d t o c o n n e c t t o p i p e (
10334/1: S E B L _ 0 _ 1 0 3 4 0 ) o n p r o c e s s 1 0 3 4 0 .\n
10334/1: time() = 1109897588
10334/1: write(19, 0x0057758C, 160) = 160
10334/1: G e n e r i c L o g\t G e n e r i c E r r o r\t 1\t 0\t 2 0 0 5
10334/1: - 0 3 - 0 3 1 7 : 5 3 : 0 8\t ( s c b c o m p . c p p ( 4 1
10334/1: 6 ) e r r = 7 1 0 0 0 1 1 s y s = 0 ) S B L - S C B - 0 0
10334/1: 0 1 1 : F a i l e d t o c o n n e c t t o p i p e (
10334/1: S E B L _ 0 _ 1 0 3 4 0 ) o n p r o c e s s 1 0 3 4 0 .\n
10334/1: ioctl(20, 0x8004667E, 0xFFBFE5B0) = 0
write 4 bytes
10334/1: setsockopt(20, SOL_SOCKET, SO_KEEPALIVE, 0xFFBFE5AC, 4, SOV_DEFAULT) = 0
10334/1: getpeername(20, 0xFFBFE54C, 0xFFBFE5B4, SOV_DEFAULT) = 0
10334/1: AF_INET name = 192.20.125.41 port = 52624
10334/1: send(20, 0x0057B650, 172, 0) = 172
10334/1: \0\0\0A8\0\0\0\0\0\0\001\0\0\001\0\0\001\0\0\003\0\0\0\0\0\0\0\0
10334/1: \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\003\0\0\0\f\0\0\0 p\0\0\001
10334/1: \0\0\001\0\0\001\0\00406\0\0\0 X\0 f\0 a\0 i\0 l\0 e\0 d\0 \0 t
10334/1: \0 o\0 \0 t\0 r\0 a\0 n\0 s\0 f\0 e\0 r\0 \0 c\0 o\0 n\0 n\0 e
10334/1: \0 c\0 t\0 i\0 o\0 n\0 \0 t\0 o\0 \0 \0 c\0 o\0 m\0 p\0 o\0 n
10334/1: \0 e\0 n\0 t\0\0\0 l V k
10334/1: shutdown(20, 2, SOV_DEFAULT) = 0
10334/1: close(20) = 0
10334/1: time() = 1109897588
10334/1: write(19, 0x0057758C, 160) = 160
10334/1: G e n e r i c L o g\t G e n e r i c E r r o r\t 1\t 0\t 2 0 0 5
10334/1: - 0 3 - 0 3 1 7 : 5 3 : 0 8\t ( s c b c o m p . c p p ( 2 3
10334/1: 5 ) e r r = 7 1 0 0 0 1 1 s y s = 0 ) S B L - S C B - 0 0
10334/1: 0 1 1 : F a i l e d t o c o n n e c t t o p i p e (
10334/1: S E B L _ 0 _ 1 0 3 4 0 ) o n p r o c e s s 1 0 3 4 0 .\n
10334/1: pollsys(0xFFBFCC08, 1, 0xFFBFEC70, 0x00000000) = 1
10334/1: fd=17 ev=POLLRDNORM rev=POLLRDNORM
10334/1: timeout: 5.000000000 sec
10334/1: accept(17, 0xFFBFECEC, 0xFFBFECFC, SOV_DEFAULT) = 20
10334/1: AF_INET name = 192.20.125.41 port = 52625
connect(21, 0xFFBFE0E8, 110, SOV_DEFAULT) Err#146 ECONNREFUSEDSo, we opened up a case with Sun, and it turns out to be there is some race condition that is preventing correct handling of
AF_UNIX name = /export/home/giri/18104/siebsrvr/temp/SEBL_0_10340
close()
, that is happening in parallel with accept()
(Siebel client does a series of connect(); write(); close()
); and to some extent the low number (32) of maximum backlog supported by the TL driver of Solaris. The following bug was logged against TL driver of Solaris, and the fix was integrated in Nevada build 14 (snv_14): 6249138 Race between accept() and eager close may confuse AF_UNIX socket
Thanks to the OpenSolaris project, now we can see the fixed code, at: tl.c
4352289 TL_MAXQLEN needs to be higher
has some information on the maximum backlog of TL (local transport) driver.Once the system is patched up with the fixed tl driver, everything seems normal; and didn't encounter any issues further
Performance
After all this effort, you may ask, is it worth installing the enterprise application(s) on Solaris 10? The simple answer is: Yes, it is. One of the major concerns is the performance of the application; and there is a noticeable performance improvement of ~4% in CPU utilization (just by running it) on Solaris 10, compared to Solaris 9 (keeping the configurations of the machines and the application, same, except the OS).
Under very high loads on the system, nearly 20% of the CPU time being spent in merely handling TLB/TSB misses, than doing some useful work. With the advent of multiple page size support for data, this wastage was reduced to 10% by using 4M pages on Solaris 9 & 10. Still there is more potential for reducing the wastage of CPU cycles, by reducing iTLB miss rate. And the good news is that the large page support for executables, libraries, and files (in short: MPSS for instructions) has been introduced in Solaris Express Nevada build 15 (06/2005).
Acknowledgements:
Horace Lee, Alexander Kolbasov & Chris Gerhard
_________________
Technorati tags: Sun | Solaris | Siebel | DTrace |OpenSolaris
No comments:
Post a Comment