The following material is a concoction of information on binary compatibility, appcert tool etc., and illustrates how Siebel 7.7 can be installed and run on Solaris 10, with some troubleshooting steps. Consider this blog post as a case study for installing Siebel 7.7 on Solaris 10; hence most of the material talks about Siebel 7.7/Solaris 10 combination, as we walk through from the decision making , to the installation of tl driver patch, for it (the combo) to handle high user load. However the underlying execution methodology of application migration to Solaris 10 will be the same for almost all enterprise applications.
In general, we can classify the customers into:
- those who wants the latest copy of the application, but do not want to upgrade their OS to a new version
- those who do not want to upgrade the application, but wants to upgrade their OS to a new version
- those who do not want to upgrade either the application or the OS version
- those who wants to upgrade both the application and the OS version
I am going to narrate the whole story of Siebel7.7/Solaris 10 migration, by assuming myself as a Siebel/Solaris customer with Siebel 7.7 running on Solaris 9 (case 2, in above classification)
Decision Making: Worth considering Solaris 10?Let's start with some facts
- Solaris 10 is FREE, even for commercial use
- Hoooray! Way to go, Sun! =)
- Solaris 10 is a faster OS with tons of new features
- Heard a lot about it at blogs.sun.com; and I'm aware of the 2 million downloads too. Really looking forward to use DTrace, (virtual) zones on my Sun Fire v1280
- Solaris 10 is not a supported platform for Siebel 7.7
- I know that only Solaris 8 & 9 are the supported versions; and I'm pretty sure that I will not be getting any support for the issues on Solaris 10. But I 've confidence on Sun's promise on ABI and compatibility (see item 6)
- Siebel Enterprise server takes care of the application logic and actually all the application specific data be stored in a database
- This is the most important thing, driving me to make this bold attempt. Because of this, I know that I'm not going to lose any data; and in the worst case, I may not be able to run Siebel 7.7 on Solaris 10. That's fine with me
- Siebel 8.0 will be available on Solaris 10 (or 10U1)
- Nice; Win-Win-Win situation for Sun-Siebel-Customer. But I can't wait until Siebel 8.0 is available
- Sun always boasts about stable ABI (Application Binary Interface) & binary compatibility
- Hope Sun and Siebel wont let me down by introducing compatibility issues
Alright! Now I know for sure that Siebel 7.7 is not certified on Solaris 10. But before I blindly install it, and run into troubles, I just want to do some preliminary check to see if I can run Siebel 7.7 on Solaris 10. Item 6, from the above list, gives us some confidence on running binaries compiled for lower OS versions, on higher OS versions. But what exactly does it mean?
Binary compatibilityBinary compatibility of an OS is the ability to run application(s) that were built for one version of OS, on later versions of OS without having to change or rebuild the application; but the same application may not run on earlier versions of the operating system
Because of the Solaris' binary compatibility (ABI stability), all the applications that were compiled on previous versions of Solaris, continue to run on the later versions of the OS, unless the application steps aside and abuses some non-standard or internal interfaces of the OS. All the non-standard, internal stuff is bound to break any time; and that's the reason why there is much insistence on using standard interfaces. If the application continues to work on later versions of the OS, even in the presence of non-standard interfaces, it must be mere luck and there is no guarantee that it is going to work in the future releases of the OS.
Okay, Okay! I'm convinced with Sun (Solaris) binary compatibility; but I'm not sure about Siebel's compliance with standard interfaces. Solaris 8 and later versions, ship a tool called
appcert
to examine application's use of unstable Solaris interfaces.
appcert
From the man page of
appcert
:
appcert
checks for:
- Private symbol usage in Solaris libraries
- These are private symbols, that is, functions or data, that are not intended for developer consumption
- Static linking
- In particular, this refers to static linking of archives libc.a, libsocket.a, and libnsl.a. Because the semantics of private symbol calls from one Solaris library to another can change from one release to another, it is not a good practice to hardwire library code into your binary objects
- Unbound symbols
- These are library symbols (that is, functions or data) that the dynamic linker could not resolve when appcert was run. This might be an environment problem (for example, LD_LIBRARY_PATH) or a build problem (for example, not specifying -llib and/or -z defs with compiling). They are flagged to point these problems out and in case a more serious problem is indicated
This kind of tool is exactly what I'm looking for. So, I simply archived the whole Siebel installation on a Solaris 9 machine, and extracted on Solaris 10 machine, for finding compatibility issues, with
appcert
tool. Since
appcert
has the ability to examine all executables and shared libraries of a product, I just specified the top level directory of the Siebel installation, as an argument to
appcert
. It took quite a while (nearly 2-3 hrs) to examine all the application binaries, and finally printed a detailed report. From the report, it appears that most of the warnings are about
unbound symbols
Sample
appcert
session:
% appcert libsslcshar.so
finding executables and shared libraries to check ...
Shared libraries were found in the application and the
following directories are appended to LD_LIBRARY_PATH:
./.
profiling: libsslcshar.so
determining list of Solaris libraries ...
checking binary objects for unstable practices ...
checking: libsslcshar.so
performing miscellaneous checks ...
------------------------------------------------------------------------
Summary: No binary stability problems detected.
A total of 1 binary objects were examined.
The following (1 of 1) components had no problems detected:
libsslcshar.so
Additional output regarding private symbols usage and other
data is in the directory:
/tmp/appcert.3001
see the appcert documentation for more information.
Since
appcert
couldn't find any potential issues, that may break the application wildly on Solaris 10, I've decided to install the application on Solaris 10 in a proper way, so it could handle some of our user traffic. As enhancements to Solaris 10, are available in the form of
Solaris Express Nevada builds, I chose Solaris Express 3/05 (was the latest one, by then) as my base OS. One noticeable thing is
uname -r
returned
5.10.1
, instead of
5.10
or
5.11
Siebel 7.7 installation on Solaris 5.10.1The installation failed during the system requirements check, itself. Though the installer didn't show any specific error message, with little effort, I found that the installer didn't like an extra dot (.) in the version of the OS. It is expecting something like
5.x
, but not
5.x.y
. Since the installer is refusing to make any progress in the installation, I have no other choice but to install it on any machine running Solaris 8 or 9, with the configuration that I'm going to use on my Solaris 10.1 box; and to complete the installation on Solaris 10.1, with the help of Siebel tools
install_gateway, install_server
install_eappweb
. The complete steps are as follows:
- Generate the configuration response files
Install Siebel 7.7 Gateway, Enterprise Server and Web Server Extension on Solaris 8 or 9. During the installation, enter all configuration options that you want to use on the Solaris 10.1
- Move the Gateway, Enterprise and Web server extension installation directories to the Solaris 10.1 machine; and make sure the installation directory, host name, and web server directory etc., are pointing to the right ones
- Finally, finish the installation on Solaris 10.1 with the following commands:
- Gateway server
cd gtwysrvr/install_script/install
./install_gateway -S -l enu -r <target-directory>
- Siebel server
cd siebsrvr/install_script/install
./install_server -S -l enu -o "enu" -r <target-directory>
cd siebsrvr/bin
./siebelmwsslsetting.ksh <target-directory>
./apache_3rdparty_link.ksh <target-directory>
- Web server extension (SWE)
cd <installation-directory>/install_script/install
./install_eappweb -S -l enu -L "enu" -r <target-directory>
That's quite a bit of work; and prone to errors. I'm reluctant to use this method right away, and saved it as a last resort. So, I looked around for a solution, and finally realized that I'm on Solaris 10 (10.1, actually) and have access to
dtrace
. Since the installer is going to get the OS version by calling
uname -r
, I need to execute a small dtrace script which has entry, return points for the system call
uname
static int uname(struct utsname *);
is the signature of
uname()
and
utsname
has the following structure:
struct utsname {
char sysname[_SYS_NMLN];
char nodename[_SYS_NMLN];
char release[_SYS_NMLN];
char version[_SYS_NMLN];
char machine[_SYS_NMLN];
};
utsname
structure, and
uname()
declarations are in the header file:
utsname.hThe following little DTrace script would be helpful in interposing the system release information, on the fly
% sysrelease.d
#!/usr/sbin/dtrace -Cws
#include <sys/utsname.h>
syscall::uname:entry
{
this->in = (struct utsname *)arg0;
}
syscall::uname:return
{
copyoutstr($$1, (uintptr_t)&this->in->release[0], SYS_NMLN);
}
This script accepts an argument, the version# of our choice; so as long as the script is running, the OS returns our input as output to
uname -r
. Once the script exits, the OS returns the true version. Since this approach is simple, I've decided to feed
5.11
to this script, and finish the installation.
% uname -a
SunOS sunfire4 5.10.1 snv_10 sun4u sparc SUNW,Sun-Fire-1280R
% ./sysrelease.d 5.11
dtrace: script './sysrelease.d' matched 2 probes
dtrace: allowing destructive actions
.. script waits here doing nothing ..
In another window:
% uname -a
SunOS sunfire4 5.11 snv_10 sun4u sparc SUNW,Sun-Fire-1280R
Complete the Siebel installation. Once it is done, go back to the
previous window, and stop the script by pressing Ctrl-C
^C
% uname -a
SunOS sunfire4 5.10.1 snv_10 sun4u sparc SUNW,Sun-Fire-1280R
The installation was successful. Then I spent some time testing one of the Siebel applications manually, and then configured Siebel Enterprise server to handle a load of 500 users, with the confidence that even if something fails blatantly, there won't be any severe data loss.
So it was in production, and everybody was happy until the server was choked when there is a significant increase in the number of concurrent requests from the online users. Some of the users who are able to connect to the application, are able to continue with their work; but those new users who want to access the applications like Financial Services, eChannel, couldn't quite succeed with consistent
Server busy
message from Siebel enterprise server. It appears to be an issue with OS, as the resource utilization on the machine was very low, yet the server is not able to handle the new requests. It is easy to speculate that the problem be lying some where in the OS configuration, because the same Siebel binary with similar configuration, works well on Solaris 9. Siebel enterprise logged hundreds of messages with the error:
SBL-SCB-00011: Failed to connect to pipe (SEBL_0_10340) on process 10340
ie., Siebel Connection Broker component is not able to establish a connection with one of the object managers with pid
10340
. Due to a design flaw, instead of giving up and sending the requests to the other idle object managers, the SCBroker component keeps sending the new requests to the misbehaving object manager. This behavior is due to the built-in round-robin load balancing mechanism of SCBroker. Some of the output from
truss
tracing:
10334/1: accept(17, 0xFFBFECEC, 0xFFBFECFC, SOV_DEFAULT) = 20
10334/1: AF_INET name = 192.20.125.41 port = 52624
10334/1: mprotect(0xF6900000, 40960, 0x0007) = 0
10334/1: mprotect(0xF6900000, 40960, 0x0001) = 0
10334/1: pollsys(0xFFBFC460, 1, 0xFFBFE4C8, 0x00000000) = 1
10334/1: fd=20 ev=POLLRDNORM rev=POLLRDNORM
10334/1: timeout: 6.000000000 sec
10334/1: recv(20, 0xFFBFEB80, 256, 2) = 256
10334/1: P O S T h t t p : / / s u n f i r e 4 : 2 3 2 1 / s i
10334/1: e b e l / f i n s o b j m g r _ e n u / r r H T T P / 1 . 1\r
10334/1: \n H o s t : s u n f i r e 4\r\n C o n t e n t - T y p
10334/1: e : A p p l i c a t i o n / o c t e t - s t r e a m\r\n C o n
10334/1: t e n t - L e n g t h : 1 0 4\r\n X - S i e b e l - D i g e s
10334/1: t : b m + A B b R s P D N H t D T C x l j K U M v M y Z E =\r
10334/1: \n\r\n\0\0\0 d\0\0\0\0\0\0\001\0\0\0\0\0\0\0\0\0\0\001\0\0\0\0\0
10334/1: \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\001\0\0\0\f\0\0\0 ,\0
10334/1: so_socket(PF_UNIX, SOCK_STREAM, 0, 0x00000000, SOV_DEFAULT) = 21
10334/1: 0x00000000: ""
10334/1: connect(21, 0xFFBFE0E8, 110, SOV_DEFAULT) Err#146 ECONNREFUSED
10334/1: AF_UNIX name = /export/home/giri/18104/siebsrvr/temp/SEBL_0_10340
10334/1: close(21) = 0
10334/1: time() = 1109897588
10334/1: write(19, 0x0057758C, 160) = 160
10334/1: G e n e r i c L o g\t G e n e r i c E r r o r\t 1\t 0\t 2 0 0 5
10334/1: - 0 3 - 0 3 1 7 : 5 3 : 0 8\t ( s c b c o m p . c p p ( 8 2
10334/1: 2 ) e r r = 7 1 0 0 0 1 1 s y s = 0 ) S B L - S C B - 0 0
10334/1: 0 1 1 : F a i l e d t o c o n n e c t t o p i p e (
10334/1: S E B L _ 0 _ 1 0 3 4 0 ) o n p r o c e s s 1 0 3 4 0 .\n
10334/1: time() = 1109897588
10334/1: write(19, 0x0057758C, 160) = 160
10334/1: G e n e r i c L o g\t G e n e r i c E r r o r\t 1\t 0\t 2 0 0 5
10334/1: - 0 3 - 0 3 1 7 : 5 3 : 0 8\t ( s c b c o m p . c p p ( 4 1
10334/1: 6 ) e r r = 7 1 0 0 0 1 1 s y s = 0 ) S B L - S C B - 0 0
10334/1: 0 1 1 : F a i l e d t o c o n n e c t t o p i p e (
10334/1: S E B L _ 0 _ 1 0 3 4 0 ) o n p r o c e s s 1 0 3 4 0 .\n
10334/1: ioctl(20, 0x8004667E, 0xFFBFE5B0) = 0
write 4 bytes
10334/1: setsockopt(20, SOL_SOCKET, SO_KEEPALIVE, 0xFFBFE5AC, 4, SOV_DEFAULT) = 0
10334/1: getpeername(20, 0xFFBFE54C, 0xFFBFE5B4, SOV_DEFAULT) = 0
10334/1: AF_INET name = 192.20.125.41 port = 52624
10334/1: send(20, 0x0057B650, 172, 0) = 172
10334/1: \0\0\0A8\0\0\0\0\0\0\001\0\0\001\0\0\001\0\0\003\0\0\0\0\0\0\0\0
10334/1: \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\003\0\0\0\f\0\0\0 p\0\0\001
10334/1: \0\0\001\0\0\001\0\00406\0\0\0 X\0 f\0 a\0 i\0 l\0 e\0 d\0 \0 t
10334/1: \0 o\0 \0 t\0 r\0 a\0 n\0 s\0 f\0 e\0 r\0 \0 c\0 o\0 n\0 n\0 e
10334/1: \0 c\0 t\0 i\0 o\0 n\0 \0 t\0 o\0 \0 \0 c\0 o\0 m\0 p\0 o\0 n
10334/1: \0 e\0 n\0 t\0\0\0 l V k
10334/1: shutdown(20, 2, SOV_DEFAULT) = 0
10334/1: close(20) = 0
10334/1: time() = 1109897588
10334/1: write(19, 0x0057758C, 160) = 160
10334/1: G e n e r i c L o g\t G e n e r i c E r r o r\t 1\t 0\t 2 0 0 5
10334/1: - 0 3 - 0 3 1 7 : 5 3 : 0 8\t ( s c b c o m p . c p p ( 2 3
10334/1: 5 ) e r r = 7 1 0 0 0 1 1 s y s = 0 ) S B L - S C B - 0 0
10334/1: 0 1 1 : F a i l e d t o c o n n e c t t o p i p e (
10334/1: S E B L _ 0 _ 1 0 3 4 0 ) o n p r o c e s s 1 0 3 4 0 .\n
10334/1: pollsys(0xFFBFCC08, 1, 0xFFBFEC70, 0x00000000) = 1
10334/1: fd=17 ev=POLLRDNORM rev=POLLRDNORM
10334/1: timeout: 5.000000000 sec
10334/1: accept(17, 0xFFBFECEC, 0xFFBFECFC, SOV_DEFAULT) = 20
10334/1: AF_INET name = 192.20.125.41 port = 52625
From the following lines, it appears that it has to do something with the AF_INET sockets on UNIX
connect(21, 0xFFBFE0E8, 110, SOV_DEFAULT) Err#146 ECONNREFUSED
AF_UNIX name = /export/home/giri/18104/siebsrvr/temp/SEBL_0_10340
So, we opened up a case with Sun, and it turns out to be there is some race condition that is preventing correct handling of
close()
, that is happening in parallel with
accept()
(Siebel client does a series of
connect(); write(); close()
); and to some extent the low number (32) of maximum backlog supported by the TL driver of Solaris. The following bug was logged against TL driver of Solaris, and the fix was integrated in Nevada build 14 (snv_14):
6249138 Race between accept() and eager close may confuse AF_UNIX socket
Thanks to the
OpenSolaris project, now we can see the fixed code, at:
tl.c4352289 TL_MAXQLEN needs to be higher
has some information on the maximum backlog of TL (local transport) driver.
Once the system is patched up with the fixed tl driver, everything seems normal; and didn't encounter any issues further
PerformanceAfter all this effort, you may ask, is it worth installing the enterprise application(s) on Solaris 10? The simple answer is: Yes, it is. One of the major concerns is the performance of the application; and there is a noticeable performance improvement of ~4% in CPU utilization (just by running it) on Solaris 10, compared to Solaris 9 (keeping the configurations of the machines and the application, same, except the OS).
Under very high loads on the system, nearly 20% of the CPU time being spent in merely handling TLB/TSB misses, than doing some useful work. With the advent of
multiple page size support for data, this wastage was reduced to 10% by using 4M pages on Solaris 9 & 10. Still there is more potential for reducing the wastage of CPU cycles, by reducing iTLB miss rate. And the good news is that the large page support for executables, libraries, and files (in short: MPSS for instructions) has been introduced in Solaris Express Nevada build 15 (06/2005).
Acknowledgements:
Horace Lee, Alexander Kolbasov & Chris Gerhard
_________________
Technorati tags:
Sun |
Solaris |
Siebel |
DTrace |
OpenSolaris