Recently at our partner (ISV, short for Independent Software Vendor) site, I was asked to look into some unexpected {huge} performance improvement {compared to Solaris 9} in running their application on a UltraSPARC based server running Solaris 10. The application was compiled/built on Solaris 8. Although Solaris 10 has many improvements, the mileage of the customer applications vary depending on the nature of the application being run. The application under discussion is a userland, CPU intensive financial application, written in C++. Apparently I didn't have much information about the way the application was compiled, and on which platform (hardware & software). So I made one of the Sun Fire v480's, a dual boot server with Solaris 9 and 10 installed on two partitions. I've installed the ISV's application on the third partition, and did a quick test by loading up the machine with virtual users until the average CPU consumption was about 85%. Interestingly I haven't seen phenomenal improvement as the ISV observed, but decent enough (~2.88%) gain that lead me to investigate further.
Here's the trapstat output from Solaris 9 env:
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
ttl | 1339705 7.7 12031 0.5 | 869027 6.3 86899 4.7 |19.2
ttl | 1371385 7.9 12165 0.5 | 931897 6.8 93874 5.1 |20.3
ttl | 1261136 7.2 11227 0.5 | 862982 6.3 86420 4.7 |18.7
ttl | 1334286 7.7 12201 0.5 | 871144 6.4 90464 4.9 |19.4
ttl | 1423610 8.2 14101 0.6 | 957773 7.1 105544 5.7 |21.6
ttl | 1399334 8.1 14120 0.6 | 973754 7.2 110116 6.0 |21.9
ttl | 1478324 8.5 13310 0.6 | 975822 7.2 104689 5.7 |21.9
ttl | 1416840 8.1 12698 0.5 | 962725 7.1 98593 5.3 |21.0
ttl | 1464161 8.4 13149 0.6 | 974467 7.2 105842 5.8 |21.9
ttl | 1412006 8.2 13685 0.6 | 915772 6.9 107461 5.9 |21.5
Average time spent in virtual to physical memory translatations: 20.74%
trapstat output from Solaris 10 env:
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
ttl | 1449113 8.7 5045 0.2 | 1015584 7.5 35621 1.7 |18.1
ttl | 1504522 9.0 5771 0.3 | 1056137 7.9 39809 1.9 |19.0
ttl | 1372013 8.2 4824 0.2 | 965968 7.2 33577 1.6 |17.2
ttl | 1366566 8.2 5194 0.2 | 988130 7.3 34719 1.6 |17.3
ttl | 1433062 8.6 5170 0.2 | 1006544 7.4 34607 1.6 |17.9
ttl | 1463364 8.8 5403 0.2 | 1023112 7.6 37313 1.7 |18.3
ttl | 1356094 8.1 4904 0.2 | 979501 7.3 34212 1.6 |17.2
ttl | 1497592 9.0 5816 0.3 | 1060080 7.9 39844 1.9 |19.0
ttl | 1468445 8.8 6166 0.3 | 1079617 8.1 42968 2.0 |19.2
ttl | 1505277 9.0 5737 0.3 | 1062101 7.8 39025 1.8 |18.9
Average time spent in virtual to physical memory translatations: 18.21%
In Solaris 9 env, the OS has spent 2.53% CPU cycles more {compared to Solaris 10}, in serving TLB/TSB misses. A closer look at the trapstat outputs reveals that Solaris 10 has less burden of serving TSB misses {for data}. There's about 3.64% difference between Solaris 9 & 10's dTSB miss%; but Solaris 10 spent ~0.75% more cycles in serving dTLB misses, compared to Solaris 9, which leaves us 2.89% (ie., 3.64% - 0.75%). Surprisingly this number (2.89%) matched with the gain (2.88%) I've observed by running the same application on both Solaris 9 & 10.
Solaris 10's dynamic TSB supportSince I ran the same application on same hardware with different versions of Solaris, I can directly attribute the improvement in performance to Solaris 10. It is no brainer to quickly realize that this is the result of the algorithmic changes to dynamic TSB support in Solaris 10.
On Solaris 9 and prior versions, depending on the physical memory installed on the machine, the system allocates a fixed number of TSBs with size 128KB or 512KB, at boot time; and since the number is fixed, all processes have to share those TSBs. Due to the limited (only 2) number of supported TSB sizes, any process that needs a TSB of size somewhere between 128 & 512, say 256KB, may either experience a miss (for eg., if the translation was done in a 128KB TSB) or wastes some memory (for eg, if the translation was done in 512KB).
Prior to version 10, Solaris is lacking the flexibility of using the right TSB size, for the right process. Recent versions of UltraSPARC chips can support TSBs of eight different sizes (8K, 16K, 32K, 64K, 128K, 256K, 512K, 1024K or 1M). By sticking to only 128K and 512K TSB's, Solaris 9 and prior versions couldn't take the advantage of the hardware capability quite efficiently.
Solaris 10 overcomes those drawbacks mentioned above by creating a TSB on the fly, per the needs of the process. Here's the corresponding RFE to fix the issues which were seen until Solaris 9:
Integrate support for Dynamic TSBs.
Now it makes more sense for me to mention about the 3% reduction in memory footprint per user, in my test runs.
To complete the original story of huge performance difference between Solaris 9 & 10, I gave them a check list to make sure they are doing apples-to-apples comparison; but I never heard from 'em back. Anyway here's the check list that I sent:
- On which hardware (US II/III/IV/..), the application was built?
This is extremely important to know, because building on US II, and running the binary on later processors (US III, III+, ..) will have significant impact on the overall performance of the application. For eg., I have seen nearly 4% performance difference in CPU utilization with some application that was built on US II, and ran on US III+ machine, with similar work loads.
- Is it the same binary that was run on both Solaris 9 & 10?
- Check the difference(s) in the run-time environments of both experiments. Have the tests been conducted on the same kind of hardware ? With same #processors? With same load? etc.,
- Make sure to use the same {application & OS} tunables, in both experiments
- Which version of Solaris 10 is in use to test the application?
The later versions of Solaris 10 (also called Solaris Express builds), enable large pages for data and instructions by default, if the OS thinks doing so is beneficial. Large pages for data (MPSS) is already introduced in Solaris 9; now Solaris 11 extends it to instructions. So, if Solaris Express bits are being used (less likely though; but there's a possibility), there is almost 13 to 15% improvement (based on the trapstat data shown above) in CPU utilization with no effort from the users.
- Make sure large pages are either enabled or not enabled on both (S9 & S10) platforms
- If the application binary is not the same, check the changes in the application, that could improve performance significantly. Also check the changes in compiler flags
________________
Technorati tag:
Solaris