Comparative data of ORACLE 10g on SPARC & SOLARIS 10

January 22, 2010

Oracle 10g OLTP performance on SPARC chips














A
boring ratio


Customers
would love to have their performance levels linked to their hardware.
But more often than you think, they migrate from System X (designed
10 years ago) to System Y (fresh from the oven) and are surprised with the performance improvements.
In the past two years, we have completed many successful migrations from F15k/E25k servers to new Enterprise Servers M9000. Customer have reported great improvements in throughput and response time. But what can you really expect and what percentage of the improvement is actually due to the operating system enhancement ?
Can the recent small frequency increase on our SPARC64 VII chipset be at all
interesting ?
The new SPARC64 VII 2.88Ghz available on our M8000 and M9000 flagships
propose no architectural change, no additional features and a modest
frequency increase going from 2.52 Ghz to 2.88 Ghz – for a ratio of
1.14. We could stop our analysis there and label this change
‘marginal’ or ‘not interesting’. But my initial testings showed a
comparative OLTP peak throughput to be way higher than this
frequency-based ratio.






What
happened ?























A
passion for Solaris


Most of the long term Sun employees have a passion
for Solaris.
Solaris is the uncontested Unix leader and include such a huge amount
of features
that when you are a Solaris addict, it is difficult to get in love
with another Operating System. And Oracle executives made no mistake
: Sun has the best UNIX kernel & performance engineers in the
world. Without them, Solaris would not scale today to a 512 hardware
thread system (M9000-64).


But of course, Solaris is a moving target. Every
release brings its truck load of features, bug fixes and other
performance improvements. Here are critical fixes done between
Solaris 10 Update 4 and the brand new Solaris 10 Update 8
influencing
Oracle performance on the M9000
:



  • In Solaris 10 Update 5 (05/08), we optimized
    interrupt management ( cr=5017144), math operations (cr=6491717). We
    also streamlined CPU yield (cr=6495392) and cache hierarchy
    (cr=6495401).


  • In Solaris 10 Update 6 (10/08), we optimized
    libraries and implemented shared context for Jupiter (cr=6655597 &
    6642758)


  • In Solaris 10 Update 7 (05/09), we enhanced
    MPXIO as well as the PCI framework (cr=6449810 and others) and
    improved thread scheduling (cr=6647538). We also enhanced Mutex
    operations (cr=6719447).


  • Finally, in Solaris 10 Update 8 , after long
    customer escalations, we fixed the single threaded nature of callout
    processing (cr=6565503-6311743). [
    This is
    critical for all calls made to nanosleep & usleep
    .]
    We also improved the throughput & latency of the very common
    e1000g driver (cr=6335837 + 5 more) and optimized the mpt driver
    (cr=6784459). We cleaned up interrupt management (cr=6799018) and
    optimized bcopy and kcopy operations (cr=6292199). Finally, we
    improved some single threaded operations (cr=6755069).



My initial SPARC64 VII iGenOLTP
tests were done with Solaris 10 Update 4. But I could not test the
new SPARC64 VII 2.88Ghz with this release because it was not
supported ! Therefore, I had to compare the new chip performance to
SPARC64VII 2.52Ghz using each S10U4 and S10U8.
We
will see below that most of the improvements are not coming from the
frequency increase but from Solaris itself.











Chips
& Chassis


Please find below , the
key characteristics of the
chips we have tested
:















































































































Chips



UltraSPARC IV+



SPARC64 VI



SPARC64 VII



SPARC64 VII (+)



Manufacturing



90nm



90nm



65nm



65nm



Die size



356 sq mm



421 sq mm



421 sq mm



421 sq mm



Transistors



295 million



540 million



600 million



600 million



Cores



2



2



4



4



Threads/core



1



2



2



2



Total threads



2



4



8



8



Frequency



1.5 Ghz



2.28 Ghz



2.5Ghz



2.88Ghz



L1 I-cache



64 KB



128 KB/core



512 KB



512 KB



L1 D-cache



64 KB



128 KB/core



512 KB



512 KB



On-chip L2



2 MB



6 MB



6 MB



6 MB



Off-chip L3



32 MB



None



None



None



Max Watts



56 W



120 W



135 W



140 W



Watts/thread



28 W



30 W



17 W



17 W







Note on (+):
The new SPARC64 VII is not officially labeled with a plus sign in
order to reflect the absence of new features.








Now,
here is our hardware list. Note that to avoid the need for a huge
Client system, we ran this iGenOltp workload in a Console/Server
mode. It means that the Java processes sending SQL queries via JDBC
are running directly on the server tested. While this model was
unusual ten years ago in the era of Client/Server, it is more and
more commonly found today in new customer deployments.






















































































































Servers



E25k



M9000-32



M9000-32



M9000-32



Chip



UltraSPARC-IV+



SPARC64 VI



SPARC64 VII



SPARC64 VII+



# chips



8



8



8



8



Total hardware threads



16



16



32



32



Frequency



1.5 Ghz



2.28 Ghz



2.52 Ghz



2.88 Ghz



System Clock



150 Mhz



960 Mhz



960 Mhz



960 Mhz~



RAM



64 GB



64 GB



64 GB*



64 GB*



Operating System



Solaris 10 Update 4



Solaris 10 Update 4



Solaris 10 Update 4 & 8



Solaris 10 Update 8























Console system







Storage



SE9990V







X4240







[shared]



64 GB cache







Opteron quad-core











25 TB







2×2.33Ghz











200 Hitachi HDD



















15k RPM



















8x2Gbit/s











Note on (~):
While the system clock has not changed, the new M9000 CMUs are
equipped with an optimized Memory Access Controller labeled MAC+. The
MAC+ chip set is critical for system reliability, in particular for
the memory mirroring and memory patrolling features. We have not
identified performance improvements linked to this new feature.


Note on (*):
Those domains have 128GB total memory. To compare apple-to-apple,
64GB of memory are allocated, populated and locked in place with my
very own _shmalloc tool.











Chart


The iGenOLTPv4
workload
is a Java-based lightweight OLTP database workload.
Simulating a classic Order Entry system, it is tested in stream mode
(I.e no wait time between transactions). For this particular
exercise, we have created a very large database of 8 Terabyte total.
This database is stored on the SE9990V using Oracle ASM. We query 100
million customer identifiers on this very large database in order to
create an I/O intensive (but not I/O bound) workload similar to the
largest OLTP installations in the world. (Example : the E25ks running
the bulk load of Oracle internal applications). The exact throughput
in number of transactions per second and average response times are
reported and coalesced for each scalability level. For this test,
we used Solaris 10 Update 4 & 8, Java version 1.6 build 16, and the
Oracle database server 10.2.0.4







Performance
notes :



  • In peak, the new
    SPARC64VII 2.88Ghz produce 1.10x OLTP throughput compared to the
    2.52Ghz on S10U8.


  • But compared to the
    2.52Ghz chips on S10U4, the ratio is 1.54x and compared to the
    SPARC64 VI it is 2.38x.


  • For a customer
    willing to upgrade a E25k equipped with 1.5Ghz chips, the throughput
    ratio is 4.125 ! It means that we can easily replace a 8 boards E25k
    with a 2 boards M8000 for better throughput and improved response
    times.


  • Average
    transaction response times in peak are
    126
    ms
    on the
    UltraSPARC IV+ domain,
    87ms
    on the SPARC64 VI, 82
    ms
    on the
    SPARC64VII 2.52Ghz (U4),
    77 ms
    on the SPARC64 VII
    2.52Ghz (U8) and
    72 ms on
    the latest chip.











Conclusion


As expected, Oracle OLTP improvements due to the new
SPARC64VII chip are modest using the latest Solaris 10. However, all
the customer already in production using previous release of Solaris
10 will see throughput improvement up to 1.54x. Most likely, this is
enough to motivate a refresh of their system. And all E25k customers
have now a very interesting value proposition with our M8000 and
M9000 chassis.



See you next time in the
wonderful world of benchmarking….




Hello world!

January 21, 2010

Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!

Inside the Sun Oracle Database machine : The F20 PCie cards

December 9, 2009

We are living a drastic change, something close to a revolution with the new Sun Oracle Database machine. Why ? With the critical use of enterprise flash memory, this architecture is not anymore reserved to data warehouses but very well suited to Online Transaction Processing.  We are preparing benchmark results on this platform and actively shipping systems to customers. In the meantime, and in a suite of short entries, I will describe the key innovations at the heart of this environment. 

Let’s start with the Sun Flash Accelerator F20 PCIe Cards.Sun Flash Accelerator F20 PCIe Card


 Each Exadata cell (or Sun Oracle Exadata Storage Server) include 384 GB total of flash storage producing up to 75,000 IOPS [8k block]. This capacity is obtained via four 96GB F20 PCIe cards detailed below. A full rack configuration comes equipped with 5.3TB of Flash storage and can produce an amazing 1 million IOPS [8k block]. This huge cache is not only used smartly and automatically by the Database machine but can also be user-managed via the ALTER TABLE STORAGE command and the CELL_FLASH_CACHE argument.

Here is the detailed architecture of the F20 PCIe card : 

F20_a

 As you can see, we obtain the total 96GB capacity via four Disk on Module (DOM). Each DOM contains four 4GB SLC NAND component on the front side and four on the back side. It gives us a 32GB capacity of which 24GB is addressable. To even accelerate flash performance, 64MB of DDR-400 DRAM per DOM provide a local buffer cache.

Finally, the DOM need to manage all of its components, track faulty blocks, handle load balancing  and communicate with the outside world using standard SATA protocols. This is achieved with a Marvell SATA2 Flash Memory Controller.

Outside of the four DOMS, a Supercapacitor module provides enough backup power to flush data from DRAM to non-volatile flash devices, therefore maintaining data integrity during a power outage. With a 5 years lifespan in a well-cooled chassis, this modules are superior to classic batteries.Finally, a LSI eight-port SAS controller connect the four DOMS to a 12-port SAS expander for external connectivity.

 We measured in Sun labs a 16.5W power consumption per F20 card. We were able to produce 1GB/s in sequential read (1MB I/O) and 100,110 IOPS (4k random read) for each card.  In addition, we can replace about two hundred 15k HDDs latest generation for a power consumption of 0.165 milliwatts per 4K I/O and an estimated MTBF of 227 years. Amazing !

See you next time in the wonderful world of benchmarking. 

_uacct=”UA-917120-1″;
urchinTracker();

What processor will fuel your first private Cloud : INTEL Nehalem or AMD Istanbul ?

July 6, 2009

>What
processor will fuel your first private Cloud : INTEL Nehalem or AMD Istanbul ?

Where IT is
going …
You
may have observed the big trend of the moment : Take your old slide
decks, banners and marketing brochures and try to plug in the word
cloud
as many times as
possible. A current Google search of the words Cloud Computing yield
today more than 31 million results ! Even if you search only on
Cloud
(getting 175 Million+
results), the first entry in the list (discounting the Sponsored
results) is this
one
. Amazing fashion of the moment !


As we recently
described in this white
paper
, there are not one but many clouds. I had recent
conversations on this topic with customers in our Menlo
Park Executive Briefing Center
. While they all say that they
will not be able to host their entire IT department in a Public
Cloud.
, they are interested in the notion of combining a Public
cloud service with multiple Private Clouds – this is the notion of
Hybrid Cloud.















Private
clouds
The
Sun Solution
Centers
and SUN
Professional Services
are starting now to build the first private
clouds architectures based on Sun Open Source products. The most
common building block for those is the versatile Sun
Blade 6000
. Why ? Because of the capacity of this chassis to host
many different type of CPU’s (x86 & SPARC) and operating systems
(Windows, Linux, OpenSolaris, Solaris or even Vmware
vSphere
). At the same time, INTEL and AMD have released two
exceptional chips : the INTEL XEON 5500 (code name Nehalem) and the
six-core AMD Opteron (code name Istanbul). I had the opportunity to
test these chips recently and will give you here a few data points.










Cloud
benchmarks


We may not have today
any Cloud related standard benchmarks. However, if I look at the
different software components of a private cloud, it seems that
Computing capabilities (in integer and floating point) and Memory
Performance are the two key dimensions to explore. You may argue that
your cloud need a database component …but improved caching
mechanism (memcached
for example) and the commoditization of Solid State Disks (see this
market
analysis
and also here)
are moving database performance profiles toward memory or cpu
intensive workloads. Additionally, the exceptional power of 10-Gbit
based Hybrid storage appliances (like the Sun
Storage 7410 Unified Storage System
) makes us less concerned by
I/O & network bound situations. It is good to know that this new
storage appliances are a key element of our public cloud
infrastructure.















Nehalem &
Istanbul Executive summary


Both AMD & INTEL
had customer investments in mind as their new chips use the same
sockets than before … so they can be used in previously released
chassis. What you will typically
have
to do
after
upgrading to the new processors is to download the latest platform
BIOS. Another good idea is also to check on your OS level … the
latest OS releases include upgraded libraries and drivers. Those are
critical if performance is near the top of your shopping list. See
here
for example.


For other features,
please refer to the key characteristics below :








































































Feature



INTEL
Xeon X5500 (Nehalem)



AMD
Opteron 2435 (Istanbul)



Release date



March
29, 2009



June
1st, 2009



Manufacturing



45 nm



45 nm



Frequency
(tested)



2.8Ghz



2.6Ghz



Cores



4



6



Strands/core



2 [if
NUMA on]



1



Total #strands



8



6



L1 cache



256
KB [32KB I. + 32KB D. per core]



768
KB [128 KB per core]



L2 Cache



1 MB
[256KB per core]



3 MB
[512KB per core]



L3 cache



2 MB
shared



6 MB
shared



Memory type



DDR3
1333Mhz max. *



DDR2
800 Mhz



Nom. Power



95 W



75W



Major
Innovations



Second
level branch predictor & TLB



Power
savings and HW virtualization



Note : For this
test, we used DDR3 1066Mhz.


Now, here is our
hardware list :





































Role



Model



Blade



Sockets@freq



RAM



AMD Opteron
‘Istanbul’



SB6000



X6260



2@2.6Ghz



24 GB



INTEL XEON
‘Nehalem’



SB6000



X6270



2@2.8Ghz



24 GB



Console



X4150



N/A



2@2.8Ghz



16 GB








Calculation
performance : iGenCPU


iGenCPU is a
calculation benchmark written in Java. It calculates Benoit
Mandelbrot’s
fractals using a custom Imaginary
Numbers
library. The main benefit of this workload is that it
naturally creates a 50% floating point and 50% integer calculation.
As the number of floating operations produced by commercial software
increase every year, this type of performance profile is getting
closer and closer to what modern web servers (like Apache) and
application servers (like Glassfish) will produce.




Here are the results
(AMD Istanbul in Blue, INTEL Nehalem in Red) :







Observations :



  1. Very similar
    peak throughput (984 fractals/s on INTEL, 1008 fractals/s on
    AMD)


  2. The AMD chip
    produce superior throughput at any level of concurrency. At 8
    threads, which is a very common scalability limit for commercial
    virtualization products, it produces 28% more throughput than
    Nehalem.


  3. It shows the
    superiority of the Opteron calculation co-processors as we had
    already observed on previous quad-core generation.


  4. It is more
    important for calculation to have larger L1/L2 cache then faster
    L1/L2 cache. The Opteron micro-architecture is naturally a better
    fit for this workload.








Memory
performance : iGenRAM


It is a classic brain
exercise when you can not sleep : imagine what you would do with $94
million in your bank account. The iGenRAM benchmark was initially
developed in C to produce an accurate simulation of the California
Lotto winner determination. It is highly memory intensive using
1Gigabyte of memory per thread. Memory allocation time as well as
memory search performance produce a combined throughput number
plotted below :




Observations
:



  1. The faster DDR3
    memory and higher frequency of the INTEL chip make it a better fit
    for memory intensive workloads. In peak, the Nehalem based system
    produce 23% more throughput than its competitor.


  2. For a small
    number of threads (1 to 4), both system produce very similar
    numbers.


  3. Second
    level predictor
    on this repetitive workload most likely help the
    Nehalem-based system to improve its scalability curve tangent past
    four threads


  4. As noted, we
    used DDR3 1066Mhz for this Nehalem test. DDR3 1333Mhz is also
    available and will increase the INTEL chip advantage on this
    workload.
















Conclusion


At complex question,
complex answer… As you have noted, these benchmarks show the AMD
Istanbul better suited for calculation intensive workloads but also
show better memory performance of the INTEL Nehalem. Therefore,
different layers within your private cloud will need to be profiled
if you want to determine what is your best choice. And guess which
Operating System comes equipped with the right set of tools (I.e
Dynamic Tracing) to make the determination : Solaris or OpenSolaris .


[Last minute note: I
also performed Oracle 10g database benchmarks on these blades. Maybe
for another article..]










See you next time
in the wonderful world of benchmarking….





Running your Oracle database on internal Solid State Disks : a good idea ?

May 11, 2009

Scaling MySQL and ZFS on T5440






Solid State Disks : a 2009 fashion


This technology is not new : it
originates in 1874 when a German physicist named Karl Braun (pictured
above) discovered that he could rectify alternating current with a
point-contact semiconductor. Three years later, he had built the
first CRT
oscilloscope and four years later, he had built the first prototype
of a Cat’s whisker diode, later optimized by G.
Marconi
and G.
Pickard
. In 1909, K. Braun shared the Nobel Prize for physics
with G. Marconi.


The Cat’s
whisker
diodes are considered the first solid state devices. But
it is only in the 1970s that they appeared in high-end mainframes
produced by Amdahl and Cray Research. However, their high-cost of
fabrication limited their industrialization. Several companies
attempted later to introduce the technology to the mass market
including StorageTek, Sharp and M-systems. But the market was not
ready.


Nowadays, SSDs are composed of
one of two technologies : DRAM volatile memory or NAND-flash
non-volatile memory. Key recent announcements from Sun (Amber
road
and ZFS),
HP (IO
Accelerator
) and Texas Instruments (Ram
San 620
) as well as lower cost of fabrication and larger
capacities are making the NAND based technology a must-try for every
company this year.


This article is looking at the Oracle database performance of our
new 32Gbytes SSDs
OEM’d from Intel
. This new devices have improved their I/O
capacity and MTBF with an architecture featuring 10 parallel NAND
flash channels. See this announcement
for more.


If you dig a little bit on the question, you will find this
whitepaper
. However, the 35% boost in performance that they measured seems
insufficient to justify trashing HDDs for SSDs. In addition, as they
compare a different number of HDDs and SSDs, it is very hard to
determine the impact of a one-to-one replacement. Let’s make our own
observations.






Here is a picture of the SSD tested – thanks to Emie for the
shot !










Goals


As any DBA knows, it is very difficult
to characterize a database workload in general. We are all
very familiar with the famous “Your mileage may vary” or
“All customer database workloads are different”. And we
can not trust Marketing department on SSDs performance claims because
nobody is running a synthetic I/O generator for a living. What we
need to determine is the impact for End-Users (Response time anyone
?) and how the Capacity Planners can benefit from the technology (How
about Peak Throughput ?).


My plan is to perform two tests on a Sun Blade X6270
(Nehalem-based) equipped with two Xeon chips and 32Gb of RAM on one
SSD and one HDD- with different expectations.



  1. Create a 16 Gigabytes database that will be entirely cached
    in the Oracle SGA. Will we observe any difference ?


  2. Create a 50 Gigabytes database that can only be cached about
    50% of the time. We expect a significant performance impact. But how
    much ?






SLAMD and iGenOLTP
The
SLAMD Distributed Load Generation Engine (SLAMD) is a Java-based
application designed for stress testing and performance analysis of
network-based applications. It was originally developed by Sun
Microsystems, Inc.
, but it has been released as an open source
application under the Sun
Public License
, which is an OSI-approved
open source license
. The main site for obtaining information
about SLAMD is http://www.slamd.com/.
It is also available as a java.net
project
.


iGenOLTP is a
multi-processed and multi-threaded database benchmark. As a custom
Java class for SLAMD, it is a lightweight workload composed of four
select statements, one insert and one delete. It produces a 90%
read/10% write workload simulating a global order system.






Software and Hardware summary


This study is using Solaris 10 Update 6 (released October
31st,2008), Java 1.7 build 38 (released Otober 23rd,2008),
SLAMD 1.8.2, iGenOLTP v4 for Oracle and Oracle 10.2.0.2. The hardware
tested is a Sun Blade X6270 with 2xINTEL XEON X5560 2.8Ghz and 32 GB
of DDR3 RAM . This blade has four standard 2.5 inches disks slots in
which we are installing 1×32 Gbytes Sun/Intel SSD and 1x146Gb 10k RPM
SEAGATE-ST914602SS drive with read-cache and write-cache enabled.


Test 1 – Database mostly in memory


We are creating a 16 Gigabytes
database (4k block size) on one Solid State Disk and on one Seagate
HDD configured in one ZFS pool with the default block size. We are
limiting the ZFS buffer cache to 1 Gigabytes and allow an Oracle SGA
of 24 Gigabytes. All the database will be cached. We will feel the
SSD impact only on random writes (about 10% of the I/O operations)
and sequential writes (Oracle redo log). The test will become CPU
bound as we increase concurrency. We are testing from 1 to 20 client
threads (I.e database connections) in streams.





In this case and for Throughput [in
Transactions per second], the difference between HDD and SSD are
evoluting from significant to modest when concurrency increase. In
fact, this is interestingly in the midrange of the scalability curve
that we observe a peak of 71% more throughput on the SSD (at 4
threads). At 20 threads, we are mostly CPU bound, therefore the
impact of the storage type is minimal and the SSD impact on
throughput is only 9%.













For response times [in milliseconds], it is slightly lower with
42% better response times at 4 threads and 8% better at 20 threads.













Test 2 – Database mostly on disk


This time, we are creating a 50
Gigabytes database on one SSD and on one HDD configured in their
dedicated ZFS pool. Memory usage will be sliced the same way than
test 1 but will not be able to cache more than 50% of the entire
database. As a result, we will become I/O bound before we become CPU
bound. Please remember that the X6270 is equipped with two
eight-threads X5560 – a very decent 16-way database server !


Here are the results :






The largest difference is observed at 12 threads with more than
twice the transactional throughput on the SSD. In response times
(below), we observe the SSD to be 57% faster in peak and 59% faster
at 8 threads.









In a nutshell


My intent for this test was to
show you (for a classic Oracle lightweight OLTP workload)


the good news
:


When I/O bound, we can
replace two Seagate 10k RPM HDDs with one INTEL/SUN SSD for a similar
throughput and twice faster response times


On a one for one basis,
the response time difference by itself (up to 67%) will make your
end users love you instantly !


Peak throughput in memory
compared to the SSD is very close : in peak, we observed 821 TPS
(24ms RT) in memory and 685 TPS (30ms RT) on the SSD. Very nice !





and the
bad news
:


When the workload is CPU
bound, the impact of replacing your HDD by a SSD is moderate while
losing a lot of capacity.


The cost per gigabyte
need to be carefully calculated to justify the investment. Ask you
Sales rep for more…





See you next time in the wonderful world of benchmarking….

Sun Blade X6270 & INTEL XEON X5560 on OpenSolaris create the ultimate Directory Server

April 17, 2009

DSEE6.3 on X6270






Sun Blade 6000 Modular
System

As you can see in this video,
the ten rack units Sun Blade 6000 system is the way to provide a very
dense 10 blades environment which can run a mix of Solaris, Linux and
Windows on a unique choice of SPARC, AMD or INTEL processors. You
will have up to double memory and I/O capacity of competing blades
using industry standard PCIe ExpressModules. We are announcing in a
few days the most powerful of all blades : the Sun Blade X6270 based
on the new INTEL XEON X5560 processor (codename Nehalem).












Sun Blade X6270
The
blade that I have tested came equipped with two INTEL XEON X5560
processors (code name Nehalem) running at 2.8Ghz and 24 Gbytes of 1066Mhz DDR3 memory. As
I wanted to get the lowest memory latency possible, I borrowed from
engineering six 1333Mhz DDR3 4Gbytes X5870A . By placing them in
strategic slots (Bank 2,5,8 of each socket), I guaranteed they will
effectively be running at 1333Mhz , producing an ideal Directory
Server environment. (You can accurately observe that I would get a
little boost in performance by upgrading to the XEON X5570, most
likely between 2 and 4%.)










Here are the details of the configuration :






System
Configuration: SUN MICROSYSTEMS SUN BLADE X6270 SERVER MODULE


BIOS
Configuration: American Megatrends Inc.


BMC
Configuration: IPMI 1.5 (KCS: Keyboard Controller Style)


====
Processor Sockets ====================================


Version
Location Tag


——————————–
————————–


Intel(R)
Xeon(R) CPU X5560 @ 2.80GHz CPU 1


Intel(R)
Xeon(R) CPU X5560 @ 2.80GHz CPU 2


====
Memory Device Sockets ================================


Type
Status Set Device Locator Bank Locator


———–
—— — ——————- —————-


other
in use 0 D2 BANK2


other
in use 0 D5 BANK5


other
in use 0 D8 BANK8


other
in use 0 D2 BANK2


other
in use 0 D5 BANK5


other
in use 0 D8 BANK8


FLASH
in use 0


====
On-Board Devices =====================================


Zoar
2x GbE.


Zoar
2x GbE.


====
Upgradeable Slots ====================================


ID
Status Type Description



——— —————- —————————-


0
in use PCI Express PCIE0


1
available PCI Express PCIE1


2
available PCI Express PCIE2


3
available PCI Express PCIE3


4
available PCI Express PCIE4










INTEL
Xeon X5500 processors


This new family of XEON processors is based on the Intel
Processors Microarchitecture
(see diagram below). Using a 45nm manufacturing process, each 263 sq.
mm quad-core dual-thread chip has 781 million transistors, 256KB L1
cache, 1 MB of L2 cache and 8MB of L3 cache. The DDR3-1333 memory
controller is key to obtain extreme performance of memory intensive
applications. I have recently tested for an undisclosed customer in
the Silicon Valley an in-memory database showing on this chip more
than 30 times the throughput of any relational database software.






Note
on BIOS settings :
Overall INTEL recommendations on BIOS settings
for the TPC-E benchmarks were used for this test. In a nutshell, the
following parameters were enabled : NUMA, HyperThreading, MLC Spacial
& Streamer prefetchers and DCU IP & Streamer Prefetchers.
RTID was kept at the default value of 24-16-24.






Sun DSEE 6.3.1


The Sun Java System Directory Server Enterprise Edition provides a
central repository for storing and managing identity profiles and
access informations. Leading the directory market, it is a secure,
highly available and scalable product just updated with release
6.3.1.


This latest update provides fixes to replication issues in mixed
DS 5.2 and 6.x topologies, on Directory Proxy Server it improves
support for Virtualization and includes additional performance
related enhancements. Furthermore this patch release improves overall
quality and robustness of deployments. More informations are in the
Release Notes located here.











OpenSolaris


The OpenSolaris Operating
System
, a single distribution for desktop, server and HPC
deployments, is based on the Solaris kernel and created through
community collaboration at openSolaris.org
. It combines Solaris technologies and tools with modern desktop
features and applications developed by open source communities such
as GNOME, Mozilla and the Free Software Foundation. LiveCD
installation and the new network-based OpenSolaris Image
Packaging System
(IPS) simplify and speed installation and
integration with third-party applications. OpenSolaris is fully
supported, with OpenSolaris Subscriptions available from Sun ranging
from email support to 24/7 production support.


For this test, we are using OpenSolaris
build 109 – which includes some of the engineering work done by
Sun and INTEL to optimize INTEL XEON X5500 Solaris environments :


# cat /etc/release


Solaris Express Community Edition snv_109 X86


Copyright 2009 Sun Microsystems, Inc. All Rights
Reserved.


Use is subject to license terms.


Assembled 23 February 2009





Benchmarketing
a directory server


That’s right, this is not a typo. Benchmarketing
is different than benchmarking ! You can find today on the Internet
various claims for high LDAP
search performance. Now most of those performance numbers are
obtained on very small directories. It is not uncommon for white
papers authors to test a LDAP directory as small as 50,000 entries
and consider it relevant. See "Measurement
and Analysis of LDAP Performance white paper"
for an
example. Also, engineers use common tricks to increase LDAP
performance including disabling the directory logs (or writing them
in memory), returning only one or no attributes and/or querying only
a portion of the Index tree. (Note that I could not resist to try it
and under this conditions I was able to get more than 55,000 LDAP
search/s
on the X6270)


What’s happening on our customer sites is very different. An
average Sun DSEE Directory size has around 10 Million entries
accessed with a 10% ratio. (i.e 1 Million user ids). And of course,
we can not use any of the previously detailed performance tricks on a
24×7 production environment. The following benchmarking
results have been obtained using production-ready tunables.






iGenLDAPs – A
LDAP Search benchmark


The iGenLDAPs benchmark is based on SLAMD
– a load simulation framework initially developed at Sun
and now available as a java.net project. SLAMD is multi-clients,
multi-process and multi-threads making it the most scalable LDAP load
simulator on the market. As mentioned, we are querying a 10% accessed
10 Millions “ou=People” directory using DSEE6.3.1 The
directory is configured to use a maximum of 20 Gigabytes of RAM
-which is enough to cache the entire index+data. All 12 attributes
are returned
to the client (a Sun Fire X4450
server) fully loaded and connected to the blade via a private 1 Gbit
network. A Sun
StorageTek 6140
array is hosting the directory on five 15000 rpm
FC disks (RAID1+0), one RAID controller with 1 Gbytes of cache and
one 4 Gbit/second link.


Here are the results :


iGenLDAPs – 1800 seconds (30m 0s)























Count



Avg/Second



Avg/Interval



Std Dev



Corr Coeff



49647440



34239.614



1711980.069



2713.15



0.040



Exceptions Caught  


















Count



Avg/Second



Avg/Interval



Std Dev



Corr Coeff



0



0.000



0.000



0.000



0.000



Entries Returned  


























Total



Avg Value



Avg/Second



Avg/Interval



Std Dev



Corr Coeff



49647450



1.000



34239.621



1711980.103



0.000



0.000



Search Time (ms)  




















Total Duration



Total Count



Avg Duration



Avg Count/Interval



Std Dev



Corr Coeff



2873937



4964747



0.579



171198.172



0.05



0.048






Under similar conditions the fastest I ever obtained before this
test was about 23,000 searches/s ! Please try it at home and see if
you can beat the X6270 numbers. And if you can, let us know !


But you can argue (rightfully) that a standard directory
deployment does not process only LDAP searches but a mix of various
LDAP calls. This is why I also provide youo iGenLDAPsm
numbers.






iGenLDAPsm – A
LDAP Siteminder simulation

The iGenLDAPsm
simulation provides a mechanism for simulating the load that
Netegrity SiteMinder places on a directory server when it is using
that server to authenticate users. In particular, this job simulates
the requests that SiteMinder issues to the directory server when
password services are enabled. While Modify or Authenticate
operations performance is interesting, iGenLDAPsm provides a very
realistic way of determining your REAL LDAP capacity. This load will
generate in each transaction 11 LDAP Operations : 1 Authentication, 1
Bind, 1 Modify and 8 Searches. As you can see below, we obtained in
peak 4180 iGenLDAPsm transactions per second corresponding to 45,980
LDAP Operations per second ! On a single X6270 blade….






iGenLDAPsm – 1800 seconds (30m
0s)


Authentication Attempts  























Count



Avg/Second



Avg/Interval



Std Dev



Corr Coeff



4807641



4180.557



20902.783



57.34



-0.062



Successful Authentications  























Count



Avg/Second



Avg/Interval



Std Dev



Corr Coeff



4805644



4178.817



20894.087



54.89



-0.061



Failed Authentications  


















Count



Avg/Second



Avg/Interval



Std Dev



Corr Coeff



0



0.000



0.000



0.000



0.000



Authentication Time (ms)  




















Total Duration



Total Count



Avg Duration



Avg Count/Interval



Std Dev



Corr Coeff



228998500



480564



47.652



20894.087



1.62



0.321



Bind Operations Performed  























Count



Avg/Second



Avg/Interval



Std Dev



Corr Coeff



4807032



4180.026



20900.130



56.66



-0.062



Bind Time (ms)  




















Total Duration



Total Count



Avg Duration



Avg Count/Interval



Std Dev



Corr Coeff



2721396



480700



0.566



20900.000



0.016



0.060


















Modify Operations Performed  























Count



Avg/Second



Avg/Interval



Std Dev



Corr Coeff



4806293



4179.383



20896.913



55.65



-0.061



Modify Time (ms)  




















Total Duration



Total Count



Avg Duration



Avg Count/Interval



Std Dev



Corr Coeff



29812733



480600



6.203



20895.652



0.19



-0.068



Search Operations Performed  























Count



Avg/Second



Avg/Interval



Std Dev



Corr Coeff



38452462



33436.922



167184.609



447.43



-0.061



Initial Search Time (ms)  




















Total Duration



Total Count



Avg Duration



Avg Count/Interval



Std Dev



Corr Coeff



5000859



4807211



10.403



20900.913



0.375



0.015



Subsequent Search Time (ms)  




















Total Duration



Total Count



Avg Duration



Avg Count/Interval



Std Dev



Corr Coeff



146366814



33643572



4.351



146276.391



0.189



0.171



 










Conclusion


A Sun Blade 6000 can host as many as ten blades, therefore it
gives us a potential 459,800 LDAP Operations per second on a
fully loaded modular system
.


And it is easy to federate ten instances of the Directory Server.
You can use the Data Distribution feature of the Directory
Proxy Server
. Amazing technology !



See you next time in the wonderful world of
benchmarking….





Creating a new blog : MrCloud

March 24, 2009

As a new media to publish around Sun and my personal activities
around the Sun Cloud initiative, I have created today a new blog :
MrCloud.


See this entry


I saw two clouds at morning


Tinged by the rising sun,


And in the dawn they floated on


And mingled into one.


John
Brainard

Improving MySQL scalability blueprint

March 17, 2009

My previous blog entry on MySQL
scalability on the T5440
is now completed by a Sun BluePrint that
you can find here.




See
you next time in the wonderful world of benchmarking….

_uacct=”UA-917120-1″;
urchinTracker();



Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7

November 10, 2008

Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7


A new era


In the past few years, I published
many articles using Oracle as a database server. As a former Sybase
system administrator and former Informix employee, it was obviously
not a matter of personal choice. It was just because the large
majority of Sun’s customers running databases were also Oracle
customers.


This summer, in our 26 Sun Solution
Centers worldwide, I observed a shift. Yes, we were still seeing
older solutions based on DB2, Oracle, Sybase or Informix being
evaluated on new Sun hardware. But every customer project manager,
every partner, every software engineer working on a new information
system design asked us : Can we architect this solution with MySQL
?


In many cases, if you dared to reply
YES to this question, the next interrogation would be about the
scalability of the MySQL engine.


This is why I decided to write this
article.






Goals


Please find below my initial goals :



  1. Reach a high throughput of SQL queries on a 256-way Sun SPARC
    Enterprise T5440


  2. Do it 21st century style i.e.
    with MySQL and ZFS , not 20th century style i.e with
    OraSybInf… and VxFS


  3. Do it with minimal tuning i.e as close as possible as
    out-of-the-box







This article is describing how I achieved this goals. It has two
main parts : a short description of the technologies used, then a
showing of the results obtained.








Sun
SPARC Enterprise T5440 server

The T5440 server is the
first quad-socket server proposing 256 hardware threads in just four
rack units. Each socket host a UltraSPARC T2 Plus processor which
propose eight cores and 64 simultaneous threads into a single piece
of silicon. While a lot of customers are interested in the capacity
of this system to be divided into 128 two-way domains, this
article explores the database capacity of a single 256-way Solaris 10
domain.







The Zettabyte file system
Announced in 2004 and
introduced part of OpenSolaris
build 27 in November 2005, ZFS is the one-and-only 128-bit file
system. It includes many innovative features like a copy-on-write
transactional model, snapshots and clones, dynamic striping and
variable block sizes. Since July 2006, ZFS is also a key part of the
Solaris operating system . A key difference between UFS and ZFS is
the usage of the ARC [Adaptive Replacement Cache] instead of the
traditional virtual memory page cache. To obtain the performance
level shown in this article, we only had to tune the size of the ARC
cache and turn off atime management on the file systems to optimize
ZIL I/O latency. The default ZFS
recordsize
is commonly changed for database workload. For this
study, we kept the default value of 128k.











MySQL 5.1
The MySQL database server is the
leading Open Source database for Web 2.0 environment. MySQL was
introduced in May 1995 and has never stopped to be enriched with
features. The 5.1 release is an important milestone as it introduces
support for partitioning, event scheduling, XML functions and row
based replication. While Sun is actively working on implementing a
single instance highly scalable storage engine, this article is
showing how one can reach a very high level of SQL query throughput
using MySQL 5.1.29 64-bit on a 256-way server.












SLAMD and iGenOLTP
The
SLAMD Distributed Load Generation Engine (SLAMD) is a Java-based
application designed for stress testing and performance analysis of
network-based applications. It was originally developed by Sun
Microsystems, Inc.
, but it has been released as an open source
application under the Sun
Public License
, which is an OSI-approved
open source license
. The main site for obtaining information
about SLAMD is http://www.slamd.com/.
It is also available as a java.net
project
.


iGenOLTP
is a multi-processed and multi-threaded database benchmark. As a
custom Java class for SLAMD, it is a lightweight workload composed of
four select statements, one insert and one delete. It produces a 90%
read/10% write workload simulating a global order system. For this
exercise, we are using a maximum of 24 milllion customers and 240
million orders in the databases. The database is divided “sharded”
in as many pieces as the number of MySQL instances on the system.
[See this great article on database
sharding
]. For example, for 24 database instances, database 1
store customers 1 to 1 million, database 2 store customers 1 milion
to 2 million and so on. The Java threads simulating the workload are
aware of the database partitioning scheme and simulate the traffic
accordingly.


This approach can
be called “Application partitioning” as opposed to
“Database partitioning”. Because it is based on a
shared-nothing architecture, it it natively more scalable than a
shared-everything approach (as in Oracle RAC).







Java Platform Standard Edition 7


Initially released in 1995, the programming language Java started
a revolution in computer languages because of the concept of Java
Virtual Machine causing instant portability across computer
architectures. While the 1.7 JVM is still in beta release, it is the
base of my iGenOltpMysql Java class performing the workload shown in
this article. The key enhancement of the JVM 1.6 was the introduction
of native Dtrace probes. The 1.7 JDK is an update packed with
performance related enhancements including an improved Adaptive
Compiler, optimized Rapid Memory Allocation , finely tuned garbage
collector algorithms and finally a lighter thread synchronization
capability causing better scalability. For this article we used the
JDK7 build 38.



Software
and Hardware summary


This study is using Solaris 10 Update 6 (released October
31st,2008), Java 1.7 build 38 (released Otober 23rd,2008),
SLAMD 1.8.2, iGenOLTP v4.2 for MySQL and MySQL 5.1.29. The hardware
tested is a T5440 with 4xUltraSPARC T2 Plus 1.2Ghz and 64 GB of RAM .
A Sun
Blade 8000
with 10 blades each with 2xAMD Opteron 8220 2.8Ghz and
8GB RAM is used as a client system. Finally a Sun ST6140 storage
array
[with 10x146GB 15k RPM drives] is configured in RAID-1 [2
HS], with two physical volumes and connected to the T54440 with two
4GB/s controllers.










Scaling vertically first


This is a matter of methodology. The
first step is to determine the peak throughput of a single instance
of MySQL with iGenOLTP using InnoDB then use approximately 75% of
this throughput as the basis for the horizontal scalability test. ZFS
and MySQL current best practices guided the choice of all the
tunables used. [available upon request] The test is done in
stabilized load with each simulation thread executing 10 transactions
per second. Please find below the throughput and response time
scalability curves :











Note that the peak throughput is
725 transactions per second which corresponds to 4350 SQL statements
per second. We are caching the entire 1 Gbyte database. The only I/Os
happening are due to the delete/insert statements, the MySQL log and
the ZFS
Intent Log
. We will be using 75% of the peak workload simulation
as the base workload per instance for the horizontal scalability
exercise. Why 75% ? Our preliminary tests showed that it the was the
best compromise to reach maximum multi-instance throughput.






Scaling horizontally


The next step was to increase the number of instances while
increasing proportionally the database size (number of customer ids).
We will have the same 600 TPS workload requested on each instance but
querying a different range within the global data set. The beauty of
the setup is that we do not have to reinstall the MySQL binaries
multiple times : we could just use soft links. The main thing to do
was to configure 32 ZFS file systems on our ZFS
pool
and then to create & load the databases. This was easily
automated with ksh scripts. Finally, we had to customize the Java
workload to query all the database instances accurately…


Here are the results :











As you can see, we were able to reach a peak of more than
79,000 SQL queries per second on a single 4 RU server
. The
transaction throughput is still increasing after 28 instances but
this is the sweet spot for this benchmark on the T5440 as guided by
the transactions average response time. At 28 instances, we observed
less than 30ms average response time. However, for 32 instances,
response times jumped to an average of 95ms.






The main trick to achieve horizontal scalability: Optimize
thread scheduling


Solaris is using the timeshare
class as the default scheduling class. The scheduler needs to always
make sure that the thread priorities are adequately balanced. For
this test, we are running thousand of threads running this workload
and can get critical CPU User Time back by avoiding unnecessary work
by the scheduler. To achieve this, we are running the MySQL engines
and Java processes in the Fixed
Priority class
. This is achieved easily using the Solaris
priocntl
command.



Conclusion
As I mentioned in introduction, an
architecture shift is happening. Database sharding and application
partitioning are the foundation of future information systems as
pioneered by companies like Facebook
[see this interesting blog
entry
]. This article prove that Sun
Microsystems servers with CoolThread technology
are an
exceptional foundation for this change. And they will also
considerably lower your Total Cost of Ownership as illustrated in
this customer success
story
.






A very special thank you to the following experts who helped in the process or reviewed this article : Huon Sok, Allan Packer, Phil Morris, Mark Mulligan, Linda Kateley, Kevin Figiel and Patrick Cyril.


See you next time in the wonderful world of benchmarking….




_uacct=”UA-917120-1″;
urchinTracker();

The Hare and the Tortoise [X6250 vs T6320] or [INTEL XEON E5410 vs SUN UltraSPARC-T2 ]

May 20, 2008

The Hare and The Tortoise

View Benoit's profile on LinkedIn


“To
win a race the swiftness of a dart … Availeth not without a timely
start”

LeLievreEtLaTortue
 


The tree on yonder hill we spy [Sun Blade 6000
Modular Systems]
The Sun
Blade 6000
chassis support up to ten blades in a ten rack-unit
chassis and is extremely popular due to its versatility. In fact, you
can test your application today on four different chips within the same
chassis. (UltraSPARC-T1 [T6300],
UltraSPARC-T2 [T6320],
AMD Opteron dual-core [X6220]
and INTEL Xeon dual-core and quad-core [X6250].
While the Opteron and T1 blades have performance characteristics well
defined by now, I was really curious to see how the new T2 blade will
perform when compared to the Xeon Quad-Core.

A grain or
two of
hellebore
[Chips & Systems]
In term of chips details, the T2 and Xeon are diverging.
The three key differences are the total number of strands [16 times
for the T2], the CPU frequency [1.66 times more for the Xeon] and the
L2 cache size [3 times more for the Xeon].

This simple table illustrate their key characteristics :

Feature INTEL
Xeon E5410
SUN
UltraSPARC-T2
Process 45 nm 65 nm
Transistors 820 million 500 million
Cores 4 8
Strands/core 1 8
Total
#strands
4 64
Frequency 2.33Ghz 1.4Ghz
L1
cache
16KB I. + 16KB D. 16KB I. + 8KB D.
L2
cache
12 MB 4 MB
Nominal
Power
80 W 95 W


This table makes it clear that predicting
response time or
throughput  delta between this two chips is a risky endeavor !

X6250T6320

Following this two pictures [X6250 and T6320], here is our hardware
list :

Role Model System
clock
Sockets@freq RAM
T2 blade T6320 N/A 1@1.4Ghz 32 GB
Xeon blade X6250 1333 Mhz 2@2.33Ghz 32 GB
Console X4200 1000 Mhz 2@2.4Ghz 8 GB

I dare you to
the
wager still
[Benchmarks]
I ran several benchmarks (including Oracle workloads) on all
type of blades, but for the purpose of this article I will present only
the two simple micro-benchmarks iGenCPU and iGenRAM.

The
iGenCPU benchmark is a JavaTM-based CPU micro-benchmark
used to compare the CPU performance of different systems. Based on a
customized Java complex number library, the code is computing Benoit
Mandelbrot’s highly dense fractal structure using integer and
floating-point calculations. (50%/50%) The simplicity of the code as
well as its non-recursivity allow a very scalable behavior using less
than 128 Kb of memory per thread. The exact throughput in number of
fractals per second and average response times are reported and
coalesced for each scalability level.

The iGenRAM benchmark is based on the California lotto requirements.
The main purpose of this workload is to measure multi-threaded memory
allocation and multi-threaded memory searches in Java. The first step
of the benchmark is for each thread to allocate 512 Megabytes of memory
in a 3-dimensional integer arrays. The second step is to search through
this memory to determine the winning tickets. The exact throughput in
lotto tickets per millisecond as well as the average allocation and
search time are reported and coalesced for each scalability level.

 For this test, we used
Solaris 10 Update 4 and Java version 1.6.1.

And list wich
way the zephyr blows
[Results]

Here are the iGenCPU throughput & response time :

iGenCPU_blade

Notes :


1-The Hare [X6250] is starting very fast but gets tired at 8 threads
and really slow down at 12 threads
2-The Tortoise [T6320] reach more than twice the throughput of the Hare
at 60 threads.
3-Single threaded average transaction response time is two times better
on the Hare.

Now let’s look at the iGenRAM results :

iGenRAM_blade.

Notes :


1-Phenomenal memory throughput of the Hare [X6250] at low level of
threads. But in peak, the Tortoise [T6320] achieve 11% more throughput
2-When the Hare is giving up (~7 threads), the Tortoise is just warming
up, reaching its peak throughput at about 40 threads.
3-Single-threaded, it takes 9 ms to allocate 512 Mb on the Hare, 33 ms
to do the same thing on the Tortoise.
4-Single-threaded, it takes 5 ms to search through 512 Mb on the Hare,
34 ms to do the same thing on the Tortoise.

Conclusion

The race is by the tortoise won.

Cries she, “My senses do I lack ?

What boots your boasted swiftness now ?

You’re beat ! and yet you must allow,

I bore my house upon my back.”


See you next time in the wonderful world of benchmarking….
Special thanks to Mr Jean De La Fontaine [1621-1695]

_uacct=”UA-917120-1″;
urchinTracker();