Oracle 10g OLTP performance on SPARC chips

A
boring ratio
Customers
would love to have their performance levels linked to their hardware.
But more often than you think, they migrate from System X (designed
10 years ago) to System Y (fresh from the oven) and are surprised with the performance improvements.
In the past two years, we have completed many successful migrations from F15k/E25k servers to new Enterprise Servers M9000. Customer have reported great improvements in throughput and response time. But what can you really expect and what percentage of the improvement is actually due to the operating system enhancement ?
Can the recent small frequency increase on our SPARC64 VII chipset be at all
interesting ?
The new SPARC64 VII 2.88Ghz available on our M8000 and M9000 flagships
propose no architectural change, no additional features and a modest
frequency increase going from 2.52 Ghz to 2.88 Ghz – for a ratio of
1.14. We could stop our analysis there and label this change
‘marginal’ or ‘not interesting’. But my initial testings showed a
comparative OLTP peak throughput to be way higher than this
frequency-based ratio.
What
happened ?

A
passion for Solaris
Most of the long term Sun employees have a passion
for Solaris.
Solaris is the uncontested Unix leader and include such a huge amount
of features
that when you are a Solaris addict, it is difficult to get in love
with another Operating System. And Oracle executives made no mistake
: Sun has the best UNIX kernel & performance engineers in the
world. Without them, Solaris would not scale today to a 512 hardware
thread system (M9000-64).
But of course, Solaris is a moving target. Every
release brings its truck load of features, bug fixes and other
performance improvements. Here are critical fixes done between
Solaris 10 Update 4 and the brand new Solaris 10 Update 8 influencing
Oracle performance on the M9000 :
In Solaris 10 Update 5 (05/08), we optimized
interrupt management ( cr=5017144), math operations (cr=6491717). We
also streamlined CPU yield (cr=6495392) and cache hierarchy
(cr=6495401).
In Solaris 10 Update 6 (10/08), we optimized
libraries and implemented shared context for Jupiter (cr=6655597 &
6642758)
In Solaris 10 Update 7 (05/09), we enhanced
MPXIO as well as the PCI framework (cr=6449810 and others) and
improved thread scheduling (cr=6647538). We also enhanced Mutex
operations (cr=6719447).
Finally, in Solaris 10 Update 8 , after long
customer escalations, we fixed the single threaded nature of callout
processing (cr=6565503-6311743). [This is
critical for all calls made to nanosleep & usleep.]
We also improved the throughput & latency of the very common
e1000g driver (cr=6335837 + 5 more) and optimized the mpt driver
(cr=6784459). We cleaned up interrupt management (cr=6799018) and
optimized bcopy and kcopy operations (cr=6292199). Finally, we
improved some single threaded operations (cr=6755069).
My initial SPARC64 VII iGenOLTP
tests were done with Solaris 10 Update 4. But I could not test the
new SPARC64 VII 2.88Ghz with this release because it was not
supported ! Therefore, I had to compare the new chip performance to
SPARC64VII 2.52Ghz using each S10U4 and S10U8. We
will see below that most of the improvements are not coming from the
frequency increase but from Solaris itself.

Chips
& Chassis
Please find below , the
key characteristics of the
chips we have tested :
Chips | UltraSPARC IV+ | SPARC64 VI | SPARC64 VII | SPARC64 VII (+) |
Manufacturing | 90nm | 90nm | 65nm | 65nm |
Die size | 356 sq mm | 421 sq mm | 421 sq mm | 421 sq mm |
Transistors | 295 million | 540 million | 600 million | 600 million |
Cores | 2 | 2 | 4 | 4 |
Threads/core | 1 | 2 | 2 | 2 |
Total threads | 2 | 4 | 8 | 8 |
Frequency | 1.5 Ghz | 2.28 Ghz | 2.5Ghz | 2.88Ghz |
L1 I-cache | 64 KB | 128 KB/core | 512 KB | 512 KB |
L1 D-cache | 64 KB | 128 KB/core | 512 KB | 512 KB |
On-chip L2 | 2 MB | 6 MB | 6 MB | 6 MB |
Off-chip L3 | 32 MB | None | None | None |
Max Watts | 56 W | 120 W | 135 W | 140 W |
Watts/thread | 28 W | 30 W | 17 W | 17 W |
Note on (+):
The new SPARC64 VII is not officially labeled with a plus sign in
order to reflect the absence of new features.
Now,
here is our hardware list. Note that to avoid the need for a huge
Client system, we ran this iGenOltp workload in a Console/Server
mode. It means that the Java processes sending SQL queries via JDBC
are running directly on the server tested. While this model was
unusual ten years ago in the era of Client/Server, it is more and
more commonly found today in new customer deployments.
Servers | E25k | M9000-32 | M9000-32 | M9000-32 |
Chip | UltraSPARC-IV+ | SPARC64 VI | SPARC64 VII | SPARC64 VII+ |
# chips | 8 | 8 | 8 | 8 |
Total hardware threads | 16 | 16 | 32 | 32 |
Frequency | 1.5 Ghz | 2.28 Ghz | 2.52 Ghz | 2.88 Ghz |
System Clock | 150 Mhz | 960 Mhz | 960 Mhz | 960 Mhz~ |
RAM | 64 GB | 64 GB | 64 GB* | 64 GB* |
Operating System | Solaris 10 Update 4 | Solaris 10 Update 4 | Solaris 10 Update 4 & 8 | Solaris 10 Update 8 |
|
|
|
|
|
Console system |
| Storage | SE9990V |
|
X4240 |
| [shared] | 64 GB cache |
|
Opteron quad-core |
|
| 25 TB |
|
2×2.33Ghz |
|
| 200 Hitachi HDD |
|
|
|
| 15k RPM |
|
|
|
| 8x2Gbit/s |
|
Note on (~):
While the system clock has not changed, the new M9000 CMUs are
equipped with an optimized Memory Access Controller labeled MAC+. The
MAC+ chip set is critical for system reliability, in particular for
the memory mirroring and memory patrolling features. We have not
identified performance improvements linked to this new feature.
Note on (*):
Those domains have 128GB total memory. To compare apple-to-apple,
64GB of memory are allocated, populated and locked in place with my
very own _shmalloc tool.

Chart
The iGenOLTPv4
workload is a Java-based lightweight OLTP database workload.
Simulating a classic Order Entry system, it is tested in stream mode
(I.e no wait time between transactions). For this particular
exercise, we have created a very large database of 8 Terabyte total.
This database is stored on the SE9990V using Oracle ASM. We query 100
million customer identifiers on this very large database in order to
create an I/O intensive (but not I/O bound) workload similar to the
largest OLTP installations in the world. (Example : the E25ks running
the bulk load of Oracle internal applications). The exact throughput
in number of transactions per second and average response times are
reported and coalesced for each scalability level. For this test,
we used Solaris 10 Update 4 & 8, Java version 1.6 build 16, and the
Oracle database server 10.2.0.4

Performance
notes :
In peak, the new
SPARC64VII 2.88Ghz produce 1.10x OLTP throughput compared to the
2.52Ghz on S10U8.
But compared to the
2.52Ghz chips on S10U4, the ratio is 1.54x and compared to the
SPARC64 VI it is 2.38x.
For a customer
willing to upgrade a E25k equipped with 1.5Ghz chips, the throughput
ratio is 4.125 ! It means that we can easily replace a 8 boards E25k
with a 2 boards M8000 for better throughput and improved response
times.
Average
transaction response times in peak are 126
ms on the
UltraSPARC IV+ domain, 87ms
on the SPARC64 VI, 82
ms on the
SPARC64VII 2.52Ghz (U4), 77 ms
on the SPARC64 VII
2.52Ghz (U8) and 72 ms on
the latest chip.
Conclusion
As expected, Oracle OLTP improvements due to the new
SPARC64VII chip are modest using the latest Solaris 10. However, all
the customer already in production using previous release of Solaris
10 will see throughput improvement up to 1.54x. Most likely, this is
enough to motivate a refresh of their system. And all E25k customers
have now a very interesting value proposition with our M8000 and
M9000 chassis.
See you next time in the
wonderful world of benchmarking….
































