Thursday, June 4, 2009

From Instanbul with LOVE? AMD six core Opteron Review

source: anandtech.com
Despite the fact that the 45 nm Quad-core Opteron was the best server CPU at launch, a few months later AMD’s success was washed away by a tsunami called “Nehalem”. The Nehalem architecture combined subtle tweaks to an already superior integer engine with brute force tactics such as a triple channel integrated memory controller. The IMC delivered low latency and massive amounts of bandwidth thanks to the highest clocked DDR-3 DIMMs. But it was not enough for the ambitious Intel engineers. They added Simultaneous MultiThreading (SMT), and this was the final blow to any competition left standing in the server market. SMT or Hyperthreading as Intel calls it, boosted performance by 30% and more in key applications such as SAP, Oracle and MS SQL Server. The end result is that the current Xeon outperforms AMD’s best CPU’s by 60 to 85%! Historic, as Intel never had such a commanding lead since AMD entered the market with it’s Athlon MP.

One could start debating about some of the details of these benchmarks, but that would mostly be splitting hairs. Yes, these scores were obtained with DDR3-1333, while the vast majority of X55xx servers are equipped with DDR3-1066. And yes, power consumption of the fastest Xeons is about 20W higher per CPU than on the “Shanghai” Opterons. So in order to compare in the same power range, you should compare with the E5540 at 2.53 GHz. But even with DDR3-1066 and at 2.53 GHz, the latest Xeon would - roughly estimated – outperform the best quad-cores of AMD with 40 to 70%. The lead is even higher in bandwidth intensive applications. Only in the pretty rare dense matrix applications, with Linpack being the most popular benchmark, AMD could still make a point. AMD can deliver the same amount of Gigaflops at lower power consumption and a lower price. Nice, but we are talking about the 1% of the applications on the market. The other ray of hope for AMD was the competitive performance that the Opteron 2389 2.9 GHz delivered on ESX 3.5 on our virtualized benchmark vApus Mark I. But with ESX 4.0, the new Xeon “Nehalem” should widen the gap again thanks to better hyperthreading support and the fact that EPT is fully supported in the latest ESX hypervisor. AMD’s next generation CPU is scheduled to appear in 2012, so it looks like AMD will have to leave the high-end and midrange server CPU market to Intel. Unless…

Ever since the introduction of the 45 nm CPUs, AMD has been executing very well. So well, even, that it reminds us of the K75 times. You might remember how in October 1999, AMD introduced the “K75” in 250 nm and sped up the “x86-Alpha” to 1 GHz in March 2000, only 5 months later. It has indeed been 10 years since AMD has executed so well. Only six months after the successful launch of their 45 nm quad-core, AMD rolls out their hex-core “Istanbul” at 2.6 GHz well ahead of schedule. It is basically a “Shanghai” Opteron with 2 extra cores and a slightly tweaked memory controller. What is more impressive, though, is that AMD is capable of launching a hex-core at 2.6 GHz today, a CPU that consumes only a few watt more than the six month older quad-core at 2.7 GHz. Well done, AMD. But should the IT professional care about the new six-core of AMD? In which applications does it make sense to consider an “Istanbul” based server? Are two extra cores enough to bring back AMD’s Opteron on the specsheet of your next high performance server?

Do Six Cores Make Sense?

The question is not theoretical. When Intel launched their hex-core “Dunnington”, quite a few applications did not make good use of it. The quad-socket “Istanbul”-based servers will face the same problems as “Dunnington”: some server applications prefer “2n cores”, a few will not scale above eight cores and many will not get past 16 very successfully. Yes, even in the server world, quite many applications do not scale well beyond 8-16 cores. Mailservers, webservers and even some databases may be in that situation. If your database gets a lot of locks on the same amount of data, locking contention will kill off your performance once you get beyond a certain number of cores. Rendering applications are another group that start to show diminishing returns with more than 8 cores. It is pretty likely that clustering dual-socket quad-cores makes more sense that adding more cores to the same machine.

But the six-core “Istanbul” CPU has advantages too. The Nehalem Xeon offers 8 logical cores, but the two threads on each core have to share the 32 KB L1 and the tiny 256 KB L2. Istanbul can work with “only” 6 threads, but each thread gets a 64 KB L1 and an in comparison copious amount of 512 KB of L2. In a nutshell, It is clear that the new AMD “Istanbul” Opteron targets a specific market: a few compute intensive HPC applications, large databases and most importantly: “heavy” virtualized workload. The reason why we say “heavy” is that the six-core is a drop-in replacement for the current quad-core Opterons. That means that the memory capacity of the servers based on the new six-core will probably be the same. If you are consolidating lots of light loads together, you are likely to run into memory limits before you run into processing power limits.



Istanbul's Improvements

The cores inside “Istanbul” are not different from those found in Shanghai. Istanbul introduces only a few improvements: HT assist, slightly higher HT speeds, APML and x8 ECC.

X8 ECC: Each DRAM chip on a DIMM provides either 4 bits or 8 bits of a 64-bit data word. Chips that provide 4 bits are called x4 (by 4), and chips that provide 8 bits are called x8 (by 8). It takes eight x8 chips or sixteen x4 chips to make a 64-bit word, so at least eight chips are located on one or both sides of a DIMM. Istanbul’s memory controller now supports error correction for both x4 and x8 DIMMs.

APML Remote Power Management Interface: APML provides an interface that allows you to monitor and control platform power consumption via P-state limits. You need to have a CPU and BMC (management processor) that support APML on the server and you need to have some type of software (OS or management software) that supports APML and allows you to monitor power and make changes to power management parameters. Both hardware and software are in development, so this won’t be available on the servers that will be launched this month. APML is interesting as it would allow you to cap power without going into the BIOS. AMD’s PowerCap Manager allows you to limit power to a certain amount by making sure the CPU’s clock never goes beyond a certain limit, effectively underclocking the CPU. This is very useful in a datacenter that is cooling or power limited. Of course, BIOS options are not that handy in a datacenter with hundreds of servers. That is where APML could make the difference.

Higher HT Speeds: The later versions of the “Shanghai” Opteron versions support HyperTransport 3.0 or HT3. HT3 allows much higher clockspeeds than the HyperTransport links that all the older Opterons have been using so far (1GHz). The clockspeed was boosted to 2.2 GHz DDR, good for 8.8 GB/s in each direction. Istanbul pushes the clock of the HyperTransport up to 2.4GHz DDR, good for 9.6 GB/s in each direction. Or as fast as the QPI links which can be found on the slower “Nehalem” Xeons. Since the new Fiorano platform is not ready, we still have to test with an older NVIDIA MCP55 platform. But that does not matter; the CPU interconnect speed is handled by the CPUs, not the board or chipset. You can clearly see in the BIOS screenshot below:

The last improvement is HT Assist. We will discuss this feature in more detail.

HT Assist: Only for the Quad-Socket

HT assist is a probe or snoop filter AMD implemented. First, let us look at a quad Shanghai system. CPU 3 needs a cacheline which CPU 1 has access to. The most recent data is however in CPU’s 2 L2-cache.

Start at CPU 3 and follow the sequence of operations:

1. CPU 3 requests information from CPU 1 (blue “data request” arrow in diagram)
2. CPU 1 broadcasts to see if another CPU has more recent data (three red “probe request” arrows in diagram)
3. CPU 3 sits idle while these probes are resolved (four red & white “probe response” arrows in diagram)
4. The requested data is sent from CPU 2 to CPU 3 (two blue and white “data response” arrows in diagram)

There are two serious problems with this broadcasting approach. Firstly, it wastes a lot of bandwidth as 10 transactions are needed to perform a relatively simple action. Secondly, those 10 transactions are adding a lot of latency to the instruction on CPU 3 that needs the piece of data (which was requested by CPU 3 to CPU 1).

The solution to is a directory-based system, that AMD calls HT Assist. HT assist reserves 1MB portion of each CPU’s L3 cache to act as a directory. This directory tracks where that CPU’s cache lines are used elsewhere in the system. In other words the L3-caches are only 5 MB large, but a lot of probe or snoop traffic is eliminated. To understand this look at the picture below:

Let us see what happens. Start again with CPU 3:

1. CPU 3 requests information from CPU 1 (blue line)
2. CPU 1 checks its L3 directory cache to locate the requested data (Fat red line)
3. The read from CPU 1’s L3 directory cache indicates that CPU 2 has the most recent copy and directly probes CPU 2 (Dark red line)
4. The requested data is sent from CPU 2 to CPU 3 (blue and white lines)

Instead of 10 transactions, we have only 4 this time. A considerable reduction in latency and wasted bandwidth is the result. Probe “broadcasting” can be eliminated in 8 of 11 typical CPU-to-CPU transactions. Stream measurements show that 4-Way memory bandwidth improves 60%: 41.5GB/s with HT Assist versus 25.5GB/s without HT Assist.

But it must be clear that HT assist is only useful in a quad-socket system and of the utmost importance in octal CPU systems. In a dual system, broadcast is the same as a unicast as there is only one other CPU. HT assist also lowers the hitrate of L2-caches (5 MB instead of 6) so it should be disabled on 2P systems. If you look in the BIOS...

...you get 3 options next to probe filter: “auto”, “disabled” and “MP”. In automatic mode the probe filter or HT Assist will be turned off for 2P systems. You can force “HT assist” by setting “MP”, indicating there are more than 2 processors.



What Intel and AMD are Offering

Before we can dive into benchmarks, it is good to see how the vendors position their CPUs. Before we do that, a quick specsheet of the most important AMD and Intel CPUs.

Model Speed (GHz) Max. clock 4 cores busy (GHz) L2 Cache (KB) L3 Cache (MB) Interconnect Bandwidth in One Direction
Intel Xeon X5570 2.93 3.2 4 x 256 KB 8 MB 12.3 GB/s
Intel Xeon X5560 2.80 3.066 4 x 256 KB 8 MB 12.3 GB/s
Intel Xeon X5550 2.66 2.93 4 x 256 KB 8 MB 12.3 GB/s
AMD Opteron 2435 2.6 2.6 6 x 512 KB 6 MB 9.8 GB/s
Intel Xeon E5540 2.53 2.66 4 x 256 KB 8 MB 11.7 GB/s
AMD Opteron 2431 2.4 2.4 6 x 512 KB 6 MB 8.8 GB/s
AMD Opteron 2389 2.9 2.9 4 x 512 KB 6 MB 8.8 GB/s
Intel Xeon E5530 2.4 2.53 4 x 256 KB 8 MB 11.7 GB/s
Intel Xeon E5430 2.66 2.66 2 x 6 MB N/A Via FSB
AMD Opteron 2427 2.2 2.2 6 x 512 KB 6 MB 8.8 GB/s
AMD Opteron 2384 2.6 2.6 6 x 512 KB 6 MB 4 GB/s
Intel Xeon E5520 2.26 2.33 4 x 256 KB 8 MB 11.7 GB/s
Intel Xeon E5506 2.13 2.13 4 x 256 KB 4MB 9.8 GB/s
AMD Opteron 2378 2.4 2.4 4 x 512 KB 6 MB 4 GB/s

What do you get for your money? The six-cores of AMD are shown in “forest green”.

Intel Xeon Model Speed (GHz) / TDP Price AMD Opteron Model Speed (GHz) / TDP - ACP Price
X5570 2.93 / 95W $1386


X5560 2.80 x 95W $1172


X5550 2.66 / 95W $958 2435 2.6 / 75-115W $989
E5540 2.53 / 80W $744 2431 2.4 / 75-115W $698



2389 2.9 / 75-115W $698
E5530 2.4 / 80W $530 2387 2.8 / 75-115W $523
L5520 2.26 / 60W $530 2376 HE 2.3 / 55-79W $575
L5510 2.13 / 60W $423 2374 HE 2.2 / 55-79W $450
E5520 2.26 $373 2427 2.2 / 75-115W $455
E5506 2.13 $266 2382 2.6 / 75-115W $316
E5504 2.00 $224


E5502 1.86 $188 2378 2.4 / 75-115W $174

AMD has clearly recognized that it can not beat the best Xeon X55xx when it comes to raw performance. The two top models, the X5570 and X5560 stay out of reach. AMD is basically saying that – with the right application – the new six-core Opteron should be able to keep up with equally clocked Xeons X55xx. In case of the 2435, you get lower power consumption as a bonus. Notice also that the best quad-core Opterons have become significantly cheaper. The 2.9 GHz 2389 “Shanghai”, which used to be positioned against the 2.66 GHz X5550 is now competing with the E5540. The 2.9 GHz Shanghai is still no match for the Xeon E5540 2.53 but it is important to look at the complete server price. 32 GB of reg DDR-3 1066 still costs about $1200, whereas 32 GB of DDR-2 800 costs around $850. It is out of the scope of this article, but it is clear that even if the CPUs cost the same, the AMD based server will be less costly. The Xeon X55xx is after all a very new platform.

For those who love stats, the die size and transistor count table:

CPU Transitor Count (Million) Process Die Size Cores
Intel Dunnington (Xeon 74xx) 1900 45 nm 504 mm2 6
Intel Gainestown (Xeon 55xx) 731 45 nm 265 mm2 4
AMD Istanbul (Opteron 24xx) 904 45 nm 346 mm2 6
AMD Shanghai (Opteron >237x) 705 45 nm 263 mm2 4
AMD Barcelona (Opteron 23xx) 463 65 nm 283 mm2 4
Intel Tigerton (Xeon 73xx) 2 x 291 = 582 65 nm 2 x 143 mm2 4
Intel Harpertown (Xeon 54xx) 2 x 410 = 820 45 nm 2 x 107 mm2 4

AMD’s Istanbul is quite a large chip, but not as expensive as “Barcelona” to produce. The champion is the Harpertown when it comes to the lowest production costs.



Our Benchmark Methods and Choices

As is traditional now with AMD CPU launches, we got very little time to perform our benchmarks. By the time we were running with the right BIOS and figured out that our Adaptec RAID cards absolutely refused to work with this new BIOS, we had less than a week left to do our server benchmarks which take at least a few hours per setup. So we had to make some choices. Without our Adaptec card, we had to cancel the most disk intensive test we used so far: the transactional DVD Store test. For all other tests, our four local SLC SSD’s kept disk queues more than low enough.

Despite timing constraints, we tried to stay as faithful as we can to our new benchmark methodology. Remember that instead of throwing every software box we happen to have on the shelf, we decided that the “buyers” should dictate our benchmark mix. Basically, every software type that is really important should have at least one and preferably two representatives in the benchmark suite. In the table below you can find an overview of the software types servers are bought for and the benchmarks you may expect in this review. We add the “relevance” column, as “Istanbul” only targets a part of this market. Very few people will buy a hex-core for print, domain controller or mailservers.

Server Software Market Importance Benchmarks Used Relevance (Six-Core)
ERP, OLTP 10-14%

SAP SD 2-tier (Industry Standard benchmark)

Oracle Charbench (Free available benchmark)

High, but not yet published

High

Reporting, OLAP 10-17% MS SQL Server (Real world + vApus) Very high
Collaborative 14-18% MS Exchange Loadgen (TBD) Medium
Software Dev. 7% Not yet Medium
e-mail, DC, file/print 32-37% MS Exchange Loadgen (TBD) Very Low (not CPU intensive)
Web 10-14% MCS eFMS (Real World + vApus) Low
HPC 4-6% TBD Only specific dense matrix apps are relevant
Other 2%? 3dsmax (Our own bench) Medium
Virtualization 33-50% VMMark (Industry standard), vApus Mark I Very High

Due to time constraints, we decided to postpone the Exchange and Linpack benchmarking. Their relevance for evaluating “Istanbul” is low anyway. SAP benchmarks were not available at the time that we wrote this.

Benchmark Configuration

None of our benchmarks required more than 20 GB. Database files were placed on a 3 drive RAID-0 Intel X25-E SLC 32 GB SSD, log files on one Intel X25-E SLC 32 GB.

Xeon Server 1: ASUS RS700-E6/RS4 barebone
Dual Intel Xeon "Gainestown" X5570 2.93GHz
ASUS Z8PS-D12-1U
6x4GB (24GB) ECC Registered DDR3-1333
NIC: Intel 82574L PCI-E Gbit LAN

Xeon Server 2: Intel "Stoakley platform" server
Dual Intel Xeon E5450 "Harpertown" at 3GHz
Supermicro X7DWE+/X7DWN+
24GB (12x2GB) Crucial Registered FB-DIMM DDR2-667 CL5 ECC
NIC: Dual Intel PRO/1000 Server NIC

Xeon Server 3: Intel "Bensley platform" server
Dual Intel Xeon X5365 "Clovertown" 3GHz
Dual Intel Xeon L5320 at 1.86GHz
Dual Intel Xeon 5080 "Dempsey" at 3.73GHz
Supermicro X7DBE+
24GB (12x2GB) Crucial Registered FB-DIMM DDR2-667 CL5 ECC
NIC: Dual Intel PRO/1000 Server NIC

Opteron Server: Supermicro SC828TQ-R1200LPB 2U Chassis
Dual AMD Opteron 2435 at 2.6GHz
Dual AMD Opteron 8384 at 2.7GHz
Dual AMD Opteron 2222 at 3.0GHz
Dual AMD Opteron 8356 at 2.3GHz
Supermicro H8QMi-2+
24GB (12x2GB) DDR2-800
NIC: Dual Intel PRO/1000 Server NIC

vApus/Oracle Calling Circle Client Configuration
Intel Core 2 Quad Q6600 2.4GHz
Foxconn P35AX-S
4GB (2x2GB) Kingston DDR2-667
NIC: Intel PRO/1000



OLTP benchmark: Oracle Charbench “Calling Circle”

Operating System: Windows 2008 Enterprise RTM (64 bit)
Software: Oracle 10g Release 2 (10.2) for 64 bit Windows
Benchmark software: Swingbench/Charbench 2.2
Database Size: 9 GB
Typical error margin: 2-2.5%

Calling Circle is an Oracle OLTP benchmark. We test with a database size of 9 GB. To reduce the pressure on our storage system, we increased the SGA size (Oracle buffer in RAM) to 10 GB and the PGA size was set at 1.6 GB. A Calling Circle tests consists of 83% selects, 7% inserts and 10% updates. The “Calling Circle” test is run for 10 minutes. A run is repeated for 6 times and the results of the first run are discarded. The reason is that the disk queue length is sometimes close to 1, while the second run and later run with a DQL (Disk Queue Length) of 0.2 or lower. In this case, it was rather easy to run the CPU’s at 99% load. Since DQL’s were very similar, we could keep our results of the “Nehalem” article. All configurations use 2 sockets.

Oracle Calling Circle

The score of the 2.7 GHz Opteron 2384 tells us that a 2.6 GHz Opteron would score about 231. The two extra cores of the Opteron “Istanbul” 2435 add thus about 27% of performance. That is less than what hyperthreading adds to the score of the Xeon X5570, clearly demonstrating what a valuable weapon Hyperthreading is in these low IPC database workloads. The Xeon X5570 performs 50% faster than AMD’s latest six-core. Even if we take in account that the Opteron 2435 competes with the 10% lower clocked X5550, it is clear that the Xeon X55xx series outperforms the best AMD CPUs by a large margin.



Decision Support benchmark: Nieuws.be

Operating System: Windows 2008 Enterprise RTM (64 bit)
Software: SQL Server 2008 Enterprise x64 (64 bit)
Benchmark software: vApus + realworld “Nieuws.be” Database
Database Size: > 100 GB
Typical error margin: 1-2%

The Nieuws.be site is sitting on top of a pretty large database: more than 100 GB and growing. This database consists of a few hundred separate tables, which have been carefully optimized by our lab (the Sizing Servers Lab). We have described our testing methods here in more detail. As some of our readers suggested we upgraded from SQL Server 2005 SP3 to SQL Server 2008. This gave a boost of 29 to 38% (!) to the performance of our decision support database. All configurations use 2 sockets. Take a look at the MS SQL Server 2005 numbers:

Nieuws.be MS SQL Server 2005

And compare with the numbers on SQL Server 2008.

Nieuws.be MS SQL Server 2008

SQL Server 2008 is clearly better optimized for the complex queries that an OLAP database has to absorb.

Back to the hardware. In both cases, you can clearly see that this workload depends less on “uncore” factors (caches, memory bandwidth) than the OLTP test. The Opteron 2435 is a healthy 41% faster than the 2.7 GHz Opteron 2384. A 2.6 GHz quad-core Opteron would score about 385, which means that the scaling from 4 to 6 cores is excellent: 46%. That kind of scaling is very close to the theoretical maximum of 50%, but it is not enough to beat the newest Xeon, which sees its advantage shrink to a 16% lead. However, as the Opteron 2435 competes with 2.66 GHz Xeon and not the Xeon 2.93 GHz, this is the first benchmark where “Istanbul” is competitive. In sharp contrast with the quad-core “Shanghai”, which does not have a chance against the Xeon X55xx armada.



Website: MCS eFMS (Windows 2003 32 bit EE)

Operating System: Windows 2003 R2 – 32 bit
Software: MCS eFMS 9.2
Benchmark software: vApus + realworld “MCS” PHP site
Typical error margin: 1-2%

The modular MCS Enterprise Facility Management Software (MCS eFMS), developed by MCS, is one of the heavier web applications. We have described this application in more detail here. The objective of eFMS is to integrate the management of space usage (buildings), assets and equipment (such as furniture, beamers etc.), cabling infrastructure and others while keeping track of costs. MCS eFMS stores all information in a central Oracle database.

MCS eFMS integrates three key technologies: A web-based frontend that integrates CAD drawings and gets its information from a rather complex, ERP-like Oracle database. Building overview trees of all rooms available and their reservations in a certain building, drilling down using the CAD drawing to get more detail: MCS eFMS is one of the most demanding web applications we have encountered so far. MCS eFMS uses the following software:

- Microsoft IIS 6.0 (Windows 2003 Server Standard Edition R2)
- Php 4.4.0
- FastCGI
-Oracle 9.2

The results are below:

MCS eFMS 9.2 website

When we profiled the benchmark, we noticed that the php website did not scale past 8 cores. So it is an inaccurate benchmark for any system with more than 8 cores, but it does show what happens in the real world. The results clearly demonstrate the issues we talked about in the introduction of this article: many server applications do not scale well beyond 8 or 16 cores. Remember, just 4 to 5 years ago, 8 core machines were very expensive machines. In less than 5 years we have gone from 2 cores to 12 cores in a server. It is only natural that in many cases software can not use, or simply does not need all that processing power. The six-core Opteron runs at only 60% and is outperformed by its quad-core brother and of course the latest Xeon. Both the Dual Opteron 2435 and Dual Xeon X5570 (HT enabled) run at 50-60% usage. A single 8-thread Xeon X55xx is by far the best choice here.



Rendering: 3ds Max 2008

Operating System: Windows 2008 Enterprise RTM (64 bit)
Software: 3ds Max 2008
Benchmark software: Build in timer
Typical error margin: 1-2%

We used the "architecture" scene which is included in the SPEC APC 3DS Max test. All tests were done with 3ds max's default scanline renderer, SSE enabled and we rendered at HD 720p (1280x720) resolution. We measured the time it takes to render 10 frames from 20 to 29 with SSE enabled. We recorded the time and then calculated (3600 seconds * 10 frames / time recorded) how many frames a certain CPU configuration could render in one hour. All results are reported as rendered images per hour, higher is thus better.

We used the 32-bit version of 3ds Max 2008 on 64-bit Windows 2008 RTM. The 64-bit version of 3ds Max was a bit slower (especially when we used the Scanline Renderer). All CPU configurations are dual, unless when we indicate otherwise.

3ds Max 2008 32 bit - architecture scene

We wrote “Some applications do not like core counts that are not a power of 2”. Well, here you have another example. The extra cores of AMD “Istanbul” are close to useless. The Xeon x55xx series outperform the hex-core by almost 50%. The 3DS Max scanline render simply does not know what do with the 12 cores presented. CPU usage went from 50 to 80%:

We are sure that there are probably more efficient render engines out there, but it is simply not a market the AMD six-core should cater to. Nehalem-based Xeons are simply way too powerful for this kind of application. Render engines scale almost perfectly with clockspeed. So if cost is your main concern, consider the Xeon E5520 at 2.26 GHz, the cheapest CPU that still supports HT. We will test this one soon, but we expect it to deliver 67 frames per hour, which is still more than 20% better than any Opteron.



Virtualization: To Be or Not to Be

Let there be no misunderstanding: how well a new Server CPU handles virtualization determines whether it is a wallflower or a blockbuster. Thanks to the superb feedback we got on our first attempt, we have continued to refine vApus Mark I. We are very happy that despite the insane timing, we managed to pull off both 4 VM and 8 VM on ESX 3.5 update 4 and ESX 4.0 (vSphere 4 build 164009). Since it is by far the most important market for the new six-core, we decided to spend most of our time and energy here.

We have two benchmarks for you: VMmark and vApus Mark I. VMmark – which we discussed in great detail here - tries to measure a typical consolidation workload: a combination of a light mailserver, database, fileserver, website with a somewhat heavier java application. One VM is just sitting idle, representative of workloads that have to be online but which perform very little work (for example, a domain controller). In short, VMmark goes for the scenario where you want to consolidate lots and lots of smaller apps on one physical server.

There are no official VMmark scores yet, but a backup slide of AMD’s presentation talked about 41% performance increase over the Opteron 2384, a “Shanghai” Opteron at 2.7 GHz. We can use that number to get a first, very unofficial idea where a dual six-core 2435 at 2.6 GHz will land. The best score for the quad-core Opteron is 11.28, so times 1.41 gives us 15.9.

VMWare Vmark

A 2.6 GHz quad-core would, roughly estimated, get a score of 10.9, which means that adding two cores results in a performance increase of 46%. That is pretty much almost perfect scaling, and it underlines that it is not hard to make a virtualized server scale with more cores, as long as you have enough memory capacity. According to our sources at different OEMs, Hyperthreading is good for about 30%. So that means that, in contrast with our previous benchmarks, the approach of adding extra cores has paid off more than adding hyperthreading. The Xeon X5570 reigns supreme, however, when it comes to VMmark. The best Xeon is still a very significant 50% faster than the best Opteron.



vApus Mark I: Performance-Critical Applications Virtualized

If you have virtualized your datacenter a while ago, chances are that the light loads are already virtualized. What is next? Well, if you have been following the virtualization scene, you’ll know that the virtualization vendors are very actively promoting that you should virtualize your performance-critical applications. vSphere 4 allows you to use up to 8 vCPUs and up to 255 GB of RAM, Xenserver 8 vCPUs and 32 GB RAM. Hyper-V is still lagging with only 4 vCPUs and a maximum of 16 CPUs (24 with the “Dunnington” hotfix”) per host. But that will change in Hyper-V R2. Bottom line is, it is getting attractive to virtualize “heavy duty” applications too. If only to be able to migrate them (“Vmotion”, “Xenmotion”, “Live Migration”) or manage them more easily.
That is where vApus Mark I comes in: one OLAP, one DSS and two heavy websites are combined in one tile. These are the kind of demanding applications that still got their own dedicated and natively running machine a year ago. vApus Mark I shows what will happen if you virtualize them. If you want to fully understand our benchmark methodology: vApus Mark I has been described in great detail here. We have changed only one thing compared to our previous benchmarking: we used large pages as it is generally considered as a best practice (with RVI, EPT). This increases performance by 4 to 5%.

Our other choices remain the same:

* RVI and EPT are enabled on all VMs if possible
* HT-Assist is off, unless indicated otherwise

vApus Mark I uses four VMs with four server applications:

- A SQL Server 2008 x64 database running on Windows 2008 64-bit, stress tested by our in-house developed vApus software.
- Two heavy-duty MCS eFMS portals running PHP, IIS on Windows 2003 R2, stress tested by our in-house developed vApus software.
- One OLTP database, based on the Oracle 10G Calling Circle benchmark of Dominic Giles.

The beauty is that vApus (stresstesting software developed by the Sizing Servers Lab) uses actions made by real people (as can be seen in logs) to stresstest the VMs, not some benchmarking algorithm. First we look at the results in ESX 3.5 Update 4, at the moment the most popular hypervisor.

Sizing Servers vAPUS Mark I  - ESX 3.5

If you just plug Istanbul into your virtualized server, you can't tell if you're running with a six-core or quad-core. You might remember from our previous article that a 2.9 GHz 2389 scored 203. Pretty dissapointing that six cores at 2.6 GHz equals 4 cores at 2.9 GHz. What went wrong? By default, the VMware ESX 3.5 scheduler logically partitions the available cores into groups of four, called “Cells”. The objective is to schedule VM’s always on the same cell, thereby making sure that the VM’s stay in the same node and socket. This should make sure that the VM always uses local memory (instead of needing remote memory of another node) and more importantly that the caches stay “warm”. If you use the default cell size of 4 cores, one or more VM’s will be split among two sockets with lots of traffic going back and forth. Once we increase the cell size from 4 to 6 (see VMware’s knowledge base), the ugly duck becomes a swan. The six-core Opteron keeps up with the best Xeons available!

The Xeon x55xx is however somewhat crippled in this case, as ESX 3.5 update 4 does not support EPT and does not make optimal use of HyperThreading. You can see from our measurements above that hyperthreading improves the score by about 17%. According to our OEM sources, VMmark improves by up to 30% on ESX 4.0. This shows that ESX 4.0 makes better use of HyperThreading. So let us see some ESX 4.0 numbers!

Reference 175.3

45.8

45.8

155.3

Server System Based On OLAP VM Webportal VM2 Webportal VM3 OLTP VM
Dual Xeon X5570 2.93 103% 50% 51% 95%
Dual Opteron 2435 2.6 91% 43% 43% 90%
Dual Opteron 2377 2.3 82% 36% 35% 53%

Sizing Servers vAPUS Mark I  - ESX 4.0

The Nehalem-based Xeon moves forward, but does not make a huge jump. Performance of the six-core Opteron was decreased by 2%, which is inside the error margin of this benchmark. It is still an excellent result for the latest Opteron: this results means it will have no trouble competing with the 2.66 Ghz Xeon X5550. VMmark tells us that the latest Xeon “Nehalem” starts to shine when you dump huge amounts of VM on top of the server. So we decided to test with 8 VM’s. It is very unlikely that you will consolidate more than 10 Performance-Critical applications on top of one physical server, so we feel that 8 VM’s should tell the whole story. We changed only one thing: we decreased the amount of memory to the webportals from 4 to 2 GB, to make sure that the benchmark fits within the maximum of 24 GB that we had on the Xeon X5570. To keep things readable, we have made an average of each 2 identical VM’s (so OLAP VM = (OLAP VM1 + OLAP VM5)/2).

Reference 175.3

45.8

45.8

155.3

Server System Based On OLAP VM Webportal VM2 Webportal VM3 OLTP VM
Dual Xeon X5570 2.93 79% 34% 32% 47%
Dual Opteron 2435 2.6 71% 23% 23% 38%
Dual Opteron 2377 2.3 76% 19% 19% 28%

vAPUS Mark I 2 tile test  - ESX 4.0

Notice that HT-assist is a performance killer in 2P configurations: you remove two times 1 MB of L3-cache, which is a bad idea with 8 VM’s hitting your two CPUs. It is interesting to see that the Xeon X5570 starts to break away, as we increase the number of VM’s. The Xeon X5570 is about 30% faster than the Dual Opteron 2435. It gives us a clue why the VMmark scores are so extreme: the huge amount of VM’s might overemphasize world switch times for example. But even with light loads, it is very rare to find more than 20 VM’s on top of DP processor.

There is more. In the 2-tile test the ESX scheduler has to divide 16 logical CPU’s among 32 vCPU’s. That is a lot easier than dividing 12 physical CPUs among 32 vCPU’s. This might create coscheduling issues on the six-core Opteron.

So our 2-tile test was somewhat “biased” towards the Xeon X5570.

We reduced the number of vCPUs on the webportal VMs from 4 to 2. That means that we have:

- Two times 4 vCPUs for the OLAP test
- Two times 4 vCPUs for the OLTP test
- Two times 2 vCPUs for the OLTP test

Or a total of 24 vCPU’s. This test is thus biased towards the “Istanbul” processor. Remember that our reference score was based on a 4 CPU “native” score. So we adjusted the reference score of webportals to one that was obtained with 2 native CPU’s. The reference score for the OLTP and OLAP test remained unchanged. The results below are not comparable with the ones you have seen so far. It is an experiment to understand our scores better. To keep things readable, we have made an average of each 2 identical VM’s (so OLAP VM = (OLAP VM1 + OLAP VM5)/2).

Reference 175.3

45.8

45.8

155.3

Server System Based On OLAP VM Webportal VM2 Webportal VM3 OLTP VM
Dual Xeon X5570 2.93 82% 53% 53% 43%
Dual Opteron 2435 2.6 81% 38% 38% 44%

vAPUS Mark I 2 tile test - 24 vCPUs - ESX 4.0

The result is that the Xeon Nehalem is once again only 11% faster. So it is important to remember that relation between the number of vCPU’s and the Cell size is pretty important when you are dealing with MP virtual machines. We expect that the number of VM’s with more than one vCPU will increase as time goes by.



Power Consumption

Our power consumption data is preliminary. We really have to doublecheck all the power data. Very roughly, we find that the Opteron 2435 machine consumes about 35-45W less than the Xeon X5570. On a total of slightly more than 300W, that is about 10 to 15%. Idle power seems to be slightly in favor of the Xeon “Nehalem”. We’ll update this data in our next article.
Market Analysis

As always we do an analysis based on what the servers are bought for. There are quite a few fields that we have not covered in this article, but with the exception of the ERP benchmarks, those markets are hardly relevant. HT assist might improve bandwidth in a quad-socket configuration, but must be disabled in a 2P configuration. As a result the six-core has less bandwidth per core, which means that most of the HPC application will not perform better. The infrastructure market is looking for as much memory as possible, not for more processing power with the same amount of memory.

So there is only piece really missing in the puzzle: the ERP results. The SAP benchmarks are not that hard to predict: The six-core Opteron will probably improve the SAP score by 25 to 35% over a 2.7 Ghz quad-core Opteron 2389. This will not threaten the dominant position of the Nehalem Xeons which are up to 81% faster than the latter.

Server Software Market Importance Benchmarks Used Effect of 2 extra cores (Istanbul vs Shanghai) Intel Xeon X5570 2.93 vs Opteron 2435 2.6
ERP, OLTP 10-14%

SAP SD 2-tier (Industry Standard benchmark)

Oracle Charbench (Free available benchmark)

Not known yet

+27%

Not known yet

50%

Reporting, OLAP 10-17% MS SQL Server (Realworld vApus benchmark) +46% 16%
Collaborative email, DC, file/print 14-18%
32-37%
MS Exchange Loadgen (MS own load generator for MS Exchange)

Unknown

Unknown

Unknown

Unknown

Software Dev. 7% None Unknown Unknown
Web 10-14% MCS eFMS (Realworld vApus benchmark) -3% 14%
HPC
Other
4-6%
2%?
LS-DYNA (Industry Standard)
3DSMax (Our own bench)
Unkown
+5%

Unknown
50%

Virtualization 50% VMmark on ESX 4.0 (Industry Standard)
vApus Mark I on ESX 3.5
vApus Mark I on ESX 4
+41%
+37%
+35%
+/- 51%
0.7%
11-30%

The OLTP-market is also firmly in Intel's grasp. Things look better in our website benchmark, until you remember that a single Xeon X5570 performs just as well as dual six-core Opteron. That leaves two markets: Decision Support Databases and servers bought for virtualization. But that last one is incredibly important…



Conclusion

The six-core Opteron is not an alternative to the mighty Xeons in every application. The Xeons are more versatile thanks to the higher clockspeeds, higher IPC, Hyperthreading and higher bandwidth to memory. The Xeon 55xx series is clearly the better choice in OLTP, ERP, webserving, rendering and there is little doubt that it will continue to reign in the bandwidth intensive HPC workloads. There are two types of applications where we feel that the AMD six-core deserves your attention: decision support databases and virtualization.

Since the launch of ESX 3.5, VMware has said more than once that performance-critical applications such as OLTP and Decision Support Databases will perform well on top of their hypervisor. Several enhancements make the newly launched vSphere 4 an even more attractive platform for such "heavy duty" applications. Hyper-V R2 and Xen 3.4 are clearly gearing up for the same task. So it is interesting that companies are now looking into virtualizing those performance-critical applications, the applications that still got their own dedicated server a few months ago. The motivation is that virtualizing these applications would allow the complete datacenter to be managed with the same flexibility as the light, already consolidated, applications. VMotion (Xenmotion, Live Migration) can then for example be used to migrate these applications faster and much more easily.

Of course, performance-critical applications are by definition more demanding when it comes to processing power. That is exactly what vApus Mark I measures: how well do performance-critical applications perform when they are virtualized? This is a relatively “new” market where the AMD 2435 shines. The new Opteron 2435 at 2.6 GHz was a pleasant surprise on vApus Mark I: it keeps up with more expensive Xeons on ESX 3.5 update 4 while consuming less, and offers a competitive performance/watt and performance/price ratio on vSphere 4. The six-core Opteron is about 11 to 30% slower on vSphere 4 than the 2.93 GHz Xeon X5570 but the overall cost of the Istanbul platform is significantly lower (DDR-2 versus DDR-3) and the 2.6 GHz 2435 consumes less power in a virtualized environment (*). On the condition that you optimize your hypervisor well to take advantage of the six cores (cell size is for example one critical optimization), we feel that the six-core Opteron is a worthy opponent for the Xeon “Nehalem” in this market. We tested only the 2435 versus the X55xx series. The Xeon E5540 2.53 GHz versus the Opteron 2431 2.4 GHz may show a slightly different view… the six-core Opteron and Xeon are both very competitive in this area, other factors than performance/price/power might conclude the decision. There is no clear winner in this part of the market, but the big news is of course that AMD offers a worthy alternative.

VMmark tells us that the Xeon X55xx handles large amounts of VM’s much better. With “light VM’s” the amount of memory you can place in a server plays in many cases a more important role than the CPU. In that case you might be better off with a low power quad-core instead of a six-core or high-clocked quad-core.

Lastly, the six-core Opteron will be a formidable competitor in the 4P market segment. But that is for a later article.
(*) Virtualized servers do not run idle very often.

A big thanks to Tijl Deneut for sacrificing his weekend to keep testing and checking together with me. Anand and Liz helped to get this article online, thanks!
akhir

2 comments:

  1. I got a good answer from the above description,but it still requires some more update to be made. Please share more content on MSBI Online Training

    ReplyDelete
  2. Very nice blogs!!! i have to learning for lot of information for this sites…Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.

    Digital Marketing Training in Chennai

    Digital Marketing Course in Chennai

    ReplyDelete