ONE STOP IT News, Rumour & Review: Intel

Showing posts with label Intel. Show all posts

Thursday, May 6, 2010

Intel Moorestown and Atom Z600, The Fastest Smartphone Platform

by: Anand Lal Shimpi
When I wrote my first article on Intel's Atom architecture I called it The Journey Begins. I did so because while Atom has made a nice home in netbooks over the years, it was Intel's smartphone aspirations that would make or break the product. And the version of Atom that was suitable for smartphone use was two years away.

Time sure does fly. Today Intel is finally unveiling its first Atom processors for smartphones and tablets. Welcome to Moorestown.

Craig & Paul’s Excellent Adventure

Six years ago Intel’s management canned a project called Tejas. It was destined to be another multi-GHz screamer, but concerns over power consumption kept it from coming to fruition. Intel instead focused on its new Core architecture that eventually led to the CPUs we know and love today (Nehalem, Lynnfield, Arrandale, Gulftown, etc...).

When a project gets cancelled, it wreaks havoc on the design team. They live and breathe that architecture for years of their lives. To not see it through to fruition is depressing. But Intel’s teams are usually resilient, as is evidenced by another team that worked on a canceled T-project.

The Tejas team in, er, Texas was quickly tasked with coming up with the exact opposite of the chip they had just worked on: an extremely low power core for use in some sort of a mobile device (it actually started as a low power core as a part of a many core x86 CPU, but the many core project got moved elsewhere before the end of 2004). A small group of engineers were first asked to find out whether or not Intel could reuse any existing architectures in the design of this ultra low power mobile CPU. The answer quickly came back as a no and work began on what was known as the Bonnell core.

No one knew what the Bonnell core would be used in, just that it was going to be portable. Remember this was 2004 and back then the smartphone revolution was far from taking over. Intel’s management felt that people were either going to carry around some sort of mobile internet device or an evolution of the smartphone. Given the somewhat conflicting design goals of those two devices, the design team in Austin had to focus on only one for the first implementation of the Bonnell core.

In 2005, Intel directed the team to go after mobile internet devices first. The smartphone version would follow. Many would argue that it was the wrong choice, after all, when was the last time you bought a MID? Hindsight is 20/20 and back then the future wasn’t so clear. Not to mention that shooting for a mobile extension of the PC was a far safer bet for a PC microprocessor company than going after the smartphone space. Add in the fact that Intel already had a smartphone application processor division (XScale) at the time and going the MID route made a lot of sense.

The team had to make an ultra low power chip for use in handheld PCs by 2008. The power target? Roughly 0.5W.

Climbing Bonnell

An existing design wouldn’t suffice, so the Austin team lead by Belli Kuttanna (former Sun and Motorola chip designer) started with the most basic of architectures: a single-issue, in-order core. The team iterated from there, increasing performance and power consumption until their internal targets were met.

In order architectures, as you may remember, have to execute instructions in the order they’re decoded. This works fine for low latency math operations but instructions that need data from memory will stall the pipeline and severely reduce performance. It’s like not being able to drive around a stopped car. Out of order architectures let you schedule around memory dependent operations so you can mask some of the latency to memory and generally improve performance. Despite what order you execute instructions, they all must complete in the program’s intended order. Dealing with this complexity costs additional die area and power. It’s worth it in the long run as we’ve seen. All Intel CPUs since the Pentium Pro have been wide (3 - 4 issue), out of order cores, but they also have had much higher power budgets.

As I mentioned in my original Atom article in 2008 Intel was committed to using in order cores for this family for the next 5 years. It’s safe to assume that at some point, when transistor geometries get small enough, we’ll see Intel revisit this fundamental architectural decision. In fact, ARM has already gone out of order with its Cortex A9 CPU.

The Bonnell design was the first to implement Intel’s 2 for 1 rule. Any feature included in the core had to increase performance by 2% for every 1% increase in power consumption. That design philosophy has since been embraced by the entire company. Nehalem was the first to implement the 2 for 1 rule on the desktop.

What emerged was a dual issue, in-order architecture. The first of its kind from Intel since the original Pentium microprocessor. Intel has learned a great deal since 1993, so reinventing the Pentium came with some obvious enhancements.

The easiest was SMT, or as most know it: Hyper Threading. Five years ago we were still arguing about the merits of single vs. dual core processors, today virtually all workloads are at least somewhat multithreaded. SMT vastly improves efficiency if you have multithreaded code, so Hyper Threading was a definite shoe in.

Other enhancements include Safe Instruction Recognition (SIR) and macro-op execution. SIR allows conditional out of order execution depending if the right group of instructions appear. Macro-op execution, on the other hand, fuses x86 instructions that perform related ops (e.g. load-op-store, load-op-execute) so they go down the pipeline together rather than independently. This increases the effective width of the machine and improves performance (as well as power efficiency).

Features like hardware prefetchers are present in Bonnell but absent from the original Pentium. And the caches are highly power optimized.

Bonnell refers to the core itself, but when paired with an L2 cache and FSB interface it became Silverthorne - the CPU in the original Atom. For more detail on the Atom architecture be sure to look at my original article.

The World Changes, MIDs Ahead of Their Time

Silverthorne lacked integration, which wasn’t a problem for MIDs and netbooks, but it kept the chip out of smartphones. Between 2004 and Atom’s introduction in 2008 the iPhone happened. All of the sudden the clunky MIDs we were reluctantly waiting for stopped being interesting. What we wanted were more iPhones, and iPhone clones. Then came Android and the rest is history. While Atom had tremendous success in netbooks, Intel’s decision to pursue a discrete route first kept it out of smartphones.

Luckily, next on the list after the first Atom was a more integrated one with the goal of dropping power consumption. We saw this with Pine Trail, the netbook Atom that brought the memory controller and GPU on-die. Performance didn’t improve because unlike most integrated memory controllers, this one still connected to the CPU via a FSB.

Intel Atom "Diamondville" Platform 2008	Intel Atom "Pine Trail" Platform 2009-2010

Pine Trail still has all of the bells and whistles of a PC platform however. Take the PCI bus for example. Every 12 microseconds it wakes up and polls every IO on the platform. That kills idle battery life, especially when you’ve got a tiny smartphone battery. Pine Trail is useless for smartphones, and that’s where Moorestown comes in.

If you thought this was the netbook Atom squeezed into a smartphone, you’re very wrong. It’s got a completely different memory controller, a true smartphone GPU (the same core, but clocked higher than what’s in the iPhone 3GS) and a ton of power optimizations that just don’t exist in the netbook version. The chipset is also very different. The PCI bus is gone as is anything that could ruin power consumption. Intel did a lot of optimization and a lot of cutting here. What resulted is something that looks a lot like a smartphone hardware platform and nothing like what we’re used to seeing from Intel.

This is Moorestown.

Moorestown: The Two Chip Solution That Uses Five Chips

Intel calls Moorestown a two-chip solution. That’s the Lincroft SoC and the Langwell IO Hub. Intel says there’s no architecture limitation for splitting these two up, it was just a way of minimizing risk. You put the bulk of the 3rd party technologies in the Langwell IO Hub and keep the important, mostly Intel controlled components in Lincroft. This is still the first SoC that Intel is going to market with, so splitting the design into two chips makes sense. The followon to Moorestown, codenamed Medfield, will integrate these two once Intel is comfortable.

The 45nm, 140M transistor Lincroft die

Lincroft houses the CPU, GPU and memory controller and is built on Intel’s 45nm process. This isn’t the same 45nm process used in other Intel CPUs, instead it’s a special low power version that trades 6 - 8% performance for a 60% reduction in leakage. The tradeoff makes sense since the bulk of these chips will run at or below 1.5GHz. And by the way, it’s now called the Atom Z600 series.

Transistor Comparison
	Intel Atom Z5xx Series	Intel Atom Z6xx Series	NVIDIA Tegra 2
Manufacturing Process	45nm	45nm	40nm
Transistor Count	47M	140M	260M*

*Tegra 2 is a single chip solution, Intel hasn't provided specs for Langwell

Langwell, now known as the Intel Platform Controller Hub (PCH) MP20, holds virtually everything else. It’s got an image processing core that supports two cameras (1 x 5MP and 1 x VGA), USB 2.0 controller, HDMI output (1080p) and a NAND controller that can support speeds of up to 80MB/s. The whole chip is managed by a 32-bit RISC core.

Langwell is a 65nm chip built at TSMC. TSMC has existing relationships with all of the IP providers for the blocks inside Langwell, so making it at TSMC is a sensible move (a temporary one though, with Medfield Intel will integrate all of this).

Langwell (left) and Lincroft (right)

While Lincr, err, Atom Z600 and the Intel PCH MP20 are enough for a traditional system, they are not enough for a smartphone. You need wireless radios, that’s one chip for WiFi and one for 3G support. You need something to handle things like power management, charging the battery and controlling the touch screen. That’s an additional chip, called Briertown.

We’re up to four chips at this point, but you need at least one more. While modern day smartphone SoCs ship with on-package memory, Intel doesn’t yet support that. Obviously it’s not impossible to do, Marvell, TI, Qualcomm and Samsung do it with all of their SoCs. Look inside Apple’s iPad and you won’t see any DRAM chips, just a Samsung part number on the application processor package. Intel doesn’t have the same experience in building SoCs and definitely not in integrating memory so it’s not a surprise we don’t have that with Moorestown. Unfortunately this means a smartphone manufacturer will need as many as five discrete chips to support Moorestown.

Platform Size
	Moorestown
CPU + Chipset	387 mm²
Total Platform Area	4200 mm^2
SoC Package Size	13.8 mm x 13.8 mm x 1.0 mm
PCH Package Size	14 mm x 14 mm x 1.33 mm

And now we know why Intel has been showing off its extremely long form factor prototype all this time:

Aava to the Rescue: An iPhone Sized Moorestown Platform

Aava Mobile is a smartphone platform manufacturer. It does for smartphones what Pegatron (formerly ASUS) does for notebooks. Aava builds the motherboard and chassis, while the customer adds customization, software and apps.

Aava showed us its Moorestown platform which is about the size of an iPhone 3GS, but a bit narrower and thinner (although longer):

Aava’s reference platform has a 3.7” 800 x 480 OLED display (or an optional 3.8” 864 x 480 TFT display). It weighs 125g, offers 285 hours of standby battery life, 8.5 hours GSM talk time, 5.4 hours of 3G talk time and 5.2 hours of web browsing time using its 1500 mAh battery. Up to 16GB of NAND flash is supported on board.

Aava Mobile Moorestown Reference Platform
	Specifications
Dimensions with Battery	118 mm x 56 mm x 11 mm
Weight	125g
Standby Battery Life	285 hours
GSM Talk Time	8.5 hours
3G Talk Time	5.4 hours
Web Browsing Battery Life	5.2 hours
Battery Capacity	1500 mAh
Display	3.7" OLED 800 x 480 or 3.8" 864 x 480 TFT
Multitouch	Capacitive
Storage	up to 16GB NAND, micro SD card
Camera	5MP or 8MP Main 2MP Second
Wireless	WiFi, Bluetooth 2.1 + EDR

It’s got a capacitive multi-touch display and supports AGPS, digital compass, accelerometer, proximity sensor, 5MP or 8MP main camera (driven by a separate image processor), 2MP secondary camera, LED flash, FM RDS radio, stereo speaker, stereo mic, stereo headset with answer button and Bluetooth 2.1 + EDR.

It’s a pretty full featured reference platform that would allow companies to deliver a pretty powerful iPhone competitor. As for the OS...

Moblin/MeeGo: The Fastest Smartphone OS?

PC game developers often criticize Intel for holding back the whole industry by not shipping faster integrated graphics. Game developers have to target the least common denominator of graphics hardware, which happens to be Intel’s integrated graphics. So nearly all PC games suffer as a result.

Moorestown is a good bit faster than any ARM based SoC on the market today. Memory bandwidth limitations aside, if you look at our recent Apple A4 vs. Atom performance comparison you’ll see what sort of gap exists between what you get today in a smartphone and what Intel is trying to deliver:

Unfortunately for Intel, all smartphone OSes are optimized for the least common denominator in SoC performance. That is 400MHz - 1GHz ARM11 or Cortex A8 class hardware. Smartphone OS vendors need to make sure their OSes run on the majority of hardware, which just isn’t Moorestown. Intel needs something to take advantage of its added performance, so Intel had to go off and do some software work. Irony is hilarious.

Moorestown is useless if it doesn’t offer significantly better performance or user experience (or both) than its competitors. To ensure this, Intel did two things.

First, Intel bought a company called Wind River. A $400M company prior to acquisition, Intel snagged WindRiver back in July of 2009. Their mission statement? To take open source software and make it commercially viable.

Whether it’s stress testing or adding new features, Wind River takes open source software and improves it to the point where you can now sell it as a commercial product. This is similar to what Apple did with the base of much of OS X. You take some good open source projects and pay people to polish and harden the last 10 - 20% of them.

Wind River has a platform for Android. It incorporates Atom optimizations into Android, hardens the software stack and prepares it for use in Moorestown devices. Google has little incentive to dedicate a lot of support to Moorestown, so Intel had to internalize that.

The second thing Intel did to ensure Moorestown’s performance wouldn’t go to waste was the development of Moblin. A smartphone/tablet targeted Linux based OS, Moblin has been lurking in our minds for well over a year now. I never really got why Intel felt the need to support the development of a mobile OS until now.

Moblin running on Moorestown

Moblin will be the highest performance OS for Moorestown to run on top of. Until a company like Apple or Google decides to embrace Moorestown, Intel needed a way to guarantee an optimized software stack for Moorestown. Moblin is that guarantee. It’s designed from the ground up to be Atom optimized, it’ll be faster than any other OS running on Moorestown and will also do the best integration of power management for Moorestown. Intel knows the architectures of its chips best, and Moblin effectively knows whatever Intel knows.

A Moorestown specific OS could also evolve to include more CPU intensive UIs and features just wouldn’t work well on the majority of ARM devices out there, which would in turn give Moorestown a tangible feature advantage in the smartphone market.

Earlier this year Intel and Nokia announced their cooperative efforts on an OS called MeeGo. Take one part Moblin and one part Maemo and you get MeeGo. The idea is to take Moblin and expand it to more platforms (particularly ARM based devices). Moblin will eventually go away and there will only be MeeGo, however there are currently smartphones and tablets based on both Moblin and MeeGo in development.

While Moblin and MeeGo are the best platforms for Moorestown, there’s a lot of reinventing the wheel that needs to be done. Thus the first Moorestown based smartphones will likely run Android.

The Neutral Role

Carriers aren’t very happy with Apple and Google. They’ve effectively wrestled power away from the carriers and left them as nothing but network providers. In my eyes this isn’t a bad thing. Over the past several years the major carriers have shown us nothing other than they can’t be trusted with too much power. Where there is frustration, there’s money to be made.

Intel wants to capitalize on that frustration by offering the carriers an alternative. Moblin won’t be branded, carriers could customize their own builds and do whatever they want with them. The carriers would ultimately limit what could run on their phones, much like Apple does today. It puts power back in the hands of the carrier, which is something they obviously like.

Whether or not that’s a good thing for the consumer is another question entirely. Intel tells me that the carriers have learned a lot from watching Apple and Google, and that they have no interest in making the same mistakes twice. I’m not sure I believe that just yet.

More OS Support if Needed

Intel made it clear that while it’s only focusing on Android, Moblin and MeeGo at the start, if a vendor were to express interest in doing a custom design around Moorestown the answer wouldn’t be no. In other words, if Apple wanted to move iPhone OS to Moorestown, Intel will make it happen.

Intel also mentioned that Moblin is an enabling necessity for Moorestown. If that need ever goes away, it has no issues handing the market over to Apple, Google or whoever wants to carry the torch. Intel doesn’t want to be in the mobile OS business, it’s simply participating because it is compelled to in order to build the best environment for Moorestown to succeed. If Intel’s plan works out, then all smartphones would eventually use some Moorestown derivative and they would be optimized for much higher performance CPU right off the bat. We’re not there today, so Moblin has a role to play.

There's also the question of Windows 7 support. Without a PCI bus, Moorestown can't run the popular desktop OS. However if Intel were to deliver a version of Moorestown with PCI support, that could solve that problem...

Intel Takes a Stand: No Windows Phone 7 Support

Apparently someone at Microsoft must’ve peed in Intel’s cheerios because Moorestown won’t be found in any Windows Phone 7 devices. According to Intel it’s more than just a spat over breakfast, Intel claims that Windows Phone 7 is still optimized for very low end ARM SoCs. Intel went on to say that despite the advances in the OS, Windows Phone 7 isn’t progressing fast enough from an architecture standpoint and that it is an “old OS with many of the warts we’re trying to get away from”. Apparently Windows Phone 8 falls into the same category and it too will not be supported by Moorestown.

Windows Phone 7, Not Supported by Moorestown

The same goes for Symbian (obviously).

Intel says that these OSes aren’t on a steep enough roadmap to make the Moorestown investment. It’s difficult to say whether or not Intel is right, we’ll have to wait to see how Windows Phone 7 scales with performance once the first devices hit later this year.

Moblin/MeeGo x86 Everywhere: Two Years Later
x86 Everywhere: Two Years Later

In my original Atom architecture article I spoke about the benefits of having a platform that could run existing applications, in this case x86 applications. Developers don’t like porting to new hardware, which is one reason GPU computing hasn’t really taken off yet.

Since then we’ve seen a major change: the introduction of platform specific App stores. Starting with the iPhone App store and extending to most smartphone platforms (Android Marketplace, Palm App Store), with a simple way to sell their apps we’ve seen a completely new group of developers emerge specifically targeting smartphones. These aren’t your traditional developers. Companies like Adobe and Microsoft are effectively absent from any of the app stores. Instead what you find are smaller development houses putting forward smaller but very useful applications and games for use on these smartphones.

The scariest part for Intel is that none of these apps run on x86 hardware. While there are still more x86 applications than iPhone or Android apps, there are more smartphone friendly apps running on ARM architectures than x86. The advantage of being able to run existing code without lengthy port times just isn’t an advantage today. In fact, you could consider the move to x86 a disadvantage from the perspective of a company like Apple or Google. While it’d be simple to offer x86 versions of apps through a closed store system, it means extra work for the developer and for Apple with little benefit today. By aiming at the netbook first, Intel may have squandered one of its major potential advantages in the smartphone.

All isn’t lost however. There’s still the argument that the applications and algorithms that have yet to be moved to smartphones still exist in x86 form. As smartphones grow more powerful, so will the types of things we try to do on them.

The Memory Controller: 32-bit LPDDR1

The Lincroft SoC (or Atom Z600) measures 13.8 mm x 13.8 mm x 1.0 mm. That’s smartphone SoC sized. In order to hit the small package size and in order to keep power consumption down, the single channel DDR2 memory controller from the netbook Atom is gone. What we have instead is a 32-bit wide LPDDR1 memory bus capable of supporting up to 1GB of memory. At 400MHz that’s about the amount of memory bandwidth we had on PCs 10 years ago.

Intel claims that the majority of workloads on smartphones are compute and not memory bandwidth bound so the reduction in memory bandwidth isn’t going to be an issue. Lincroft's caches are the same as Silverthorne before it (24/32KB L1 + 512KB L2).

Compared to smartphone SoCs today, Intel isn’t really outgunned:

2010 Application Processor Comparison
	Memory Interface
Apple A4	32-bit LPDDR1/LPDDR2 (?)
Intel Atom Z600	32-bit LPDDR1
TI OMAP 3430	32-bit LPDDR1
TI OMAP 4430	2 x 32-bit LPDDR2
NVIDIA Tegra 2	32-bit LPDDR2
Qualcomm Snapdragon QSD8250	32-bit LPDDR1

It’s only next year when products based on TI’s OMAP 4430 chip that we’ll see a real ramp in memory bandwidth. Intel will offer a version of Lincroft for tablets with a 32-bit DDR2-800 interface. It can support a maximum of 2GB of memory.

The Lincroft memory controller has less bandwidth than the netbook version, but it's more efficient as a result. Intel included a lot of optimizations, particularly for graphics to improve bandwidth utilization.

Clock Speeds: 1.2GHz - 1.5GHz for Smartphones, 1.9GHz for Tablets

Intel isn't announcing individual Atom Z600 SKUs just yet, but we do know that all versions of the chip will support Hyper Threading (likely due to maintain a performance advantage compared to upcoming dual-core ARM offerings). There will be two versions of the Atom Z600 chips, one for smartphones and one for tablets.

The smartphone SKUs will run between 1.2GHz and 1.5GHz, while the tablet version of the Z600 will run at up to 1.9GHz.

Power Management: Clock Down or Turbo Up

Eleven years ago Intel demoed a technology it called Geyserville for mobile CPUs. The technology simply ran the CPU at a lower frequency when running on battery power and a higher frequency when plugged in. Intel eventually called this SpeedStep.

Four years later we got EIST, Enhanced Intel SpeedStep Technology. This allowed a mobile (and eventually desktop) CPU to run at any frequency depending on the performance demanded by the OS and the running applications.

On today’s Atom processors this usually means the chip will run as low as 600MHz when idle and at 133MHz increments all the way up to 1.66GHz under load. You don’t normally drop below 600MHz because that falls into the inefficient range of CPU performance scaling for a netbook/nettop. In a smartphone though, the majority of time your CPU isn’t being used. The SoC and accessory processors have enough custom logic offload a lot, even when your phone isn’t idle.

Lincroft, or the Atom Z600 series, supports even lower frequency modes. The CPU can clock itself down well below 600MHz.

When you need performance however Lincroft has something similar to Turbo Boost on Intel’s desktop CPUs. On the Atom Z600 series it’s called Burst Mode and unlike Turbo, it is more tightly integrated with the OS.

EIST and other dynamic clocking technologies rely on OS P-states to determine what frequency the chip should run at. If an OS requests P0, the CPU simply runs at its highest frequency.

On the Core i5 and i7, if the OS requests the CPU be in P0, then as long as the chip doesn’t violate any current or TDP limitations it will run at a higher turbo frequency instead of the default maximum clock speed the OS is requesting. P0 will always return the highest possible frequency given the thermal conditions of the chip.

The Atom Z600 doesn’t work like this. All potential burst mode frequencies are enumerated as P-states by the BIOS. An OS with proper support for Moorestown will be able to request any specific clock frequency, even burst frequencies. Loading a web page for example might result in the OS asking for the highest possible burst mode frequency, but while you’re reading the page the OS might request a slower P-state. The chip will run at whatever the OS requests, but it will exit burst mode if the chip’s temperature gets too high.

The FSB speed also scales with clock frequency. Once you reach a certain clock speed threshold, the Atom Z600 will automatically double its FSB frequency to help feed the CPU faster. The goal isn’t just to deliver peak performance, but it’s also to complete tasks faster so that the SoC can return to an idle state as soon as possible. The hurry up and go idle approach to mobile CPU performance has been one of Intel’s basic tenants for well over a decade now. And it does work. This is the reason we’ve generally seen an increase in battery life from each subsequent version of the Centrino platform.

The software management of burst mode puts more emphasis on the OS and platform vendors to properly tune their devices for the best balance of performance/power consumption. You can see why Wind River’s Android platform and Moblin are necessary to get the most out of Moorestown.

Power Gating

Until the Atom Z600 series, the only Intel CPUs to power gate were the Nehalem/Westmere derived chips. In Moorestown, everything is both power and clock gated.

The CPU itself has its usual power states; C0 implies full power, full performance, and C6 is a deep sleep state where power is shut off to the entire CPU and state is saved in a small amount of active SRAM. There’s finer grained clock and power gating in Lincroft than in Intel’s Core i7.

Moorestown is a SoC platform however, so we need some new power states. Intel calls them S0i1 and S0i3. As with CPU power states, the higher the number, the more that’s shut off.

Virtually all blocks in Lincroft and Langwell are clock and power gated. In S0i1, everything from the CPU and GPU to interconnects are power gated. From the sounds of it, S0i1 is where you’d find your smartphone if you just left it on the table for a few seconds. The display would shut off and all internal components would be power gated. Pick it back up, hit a button and you get pretty quick recovery.

S0i3 however completely powers down virtually all components and keeps a small amount of SRAM active with state data. This is the phone locked and in your pocket state.

Getting out of these power states is relatively quick. S0i3 takes around 3ms while S0i1 takes 1ms.

The impact of these idle states is huge. On a reference Moorestown platform (this includes the Moorestown chips, display, 3G radio, basically a fully functional phone), Intel measured total platform power in S0i3 at 21 - 23mW. That’s a ~50x reduction compared to Menlow, and that’s what makes Moorestown suitable for use in a smartphone.

OS Driven Power Management

When Intel introduced Nehalem and the Core i7, we saw a new generation of power engineering in microprocessors. In the past, the OS would request a particular performance state from the CPU and the chip would respond by changing its clock speed. Nehalem’s Power Control Unit (PCU) instead dedicated enough transistors to build a 486 to monitoring the power and performance demands on the chip. Based on those demands and what the OS was doing, the PCU would power up or down individual cores, as well as move clock speeds up or down. The PCU would guess at what the OS was trying to do and respond accordingly.

Nehalem and its successors were massive chips, eating up to 130W of power under load and idling down in the 6 - 10W range. Lincroft has to be sub-1W under load and 6mW at idle. With even more stringent power demands and a much smaller die, Intel couldn’t blow a sizeable percentage of the Lincroft transistor budget on power management.

Instead of guessing at what the OS wants, the Moorestown platform uses OS Driven Power Management (OSPM) to tell Lincroft and Langwell what to do. OSPM is supported in Moblin and presumably the Wind River build of Android.

The OSPM process tells the hardware what apps it’s running and to shut down what it doesn’t need. There are well defined operating modes - standby, internet browsing, MP3 playback, video playback, voice call, video capture, etc... Based on the profile, the hardware doesn’t have to guess at what it should turn off/on, it just does it right away.

The OSPM driver communicates directly with the two power management units in Moorestown - one in Lincroft and one in Langwell. It instructs those PMUs to shut off various blocks, and in turn they tell Briertown to gate and cut voltage to the parts of chip that aren’t needed.

I wondered if this couldn’t be done in hardware, but it seems that given current die constraints and the sort of accuracy of information it needs Intel must implement at least some of the power management control in software. Toolkits will be available for developers to control the OSPM.

Putting Power in Perspective: Estimated Battery Life of a Moorestown Phone

I wanted to get Moorestown hardware in time for the launch but unfortunately nothing is quite ready yet, so we’ll have to rely on Intel’s data.

As I just mentioned, Intel expects a Moorestown phone to idle at 21 - 23mW. Paired with a 1500mAh battery that’s 10 days of standby time. Intel claims that Snapdragon phones idle at 25mW. If that’s true then Moorestown is competitive.

Audio playback is expected to consume around 120mW of power (for the entire platform, not just the silicon). Intel estimates that’ll get you around 48 hours of continuous music playback. Intel was quick to add that this is better audio playback battery life than anyone else on the market today, although both TI and NVIDIA are promising better battery life than that with their next-generation SoCs (OMAP 4430 and Tegra 2).

Moorestown Battery Life (Figures by Intel)
	Total Phone Power Consumption
Idle	21 - 23 mW
Audio Playback	120 mW
1080p Video Playback	1.1W+
Web Browsing (WiFi)	1.1W
2G Phone Call	550 mW
3G Phone Call	1.2W

Intel’s video playback estimates are lower than the competition, Moorestown is expected to only provide 5 hours of continuous HD video playback compared to 10 hours on an iPhone 3GS. That comes from 1.1W+ platform power consumption during video playback.

Intel estimates that Moorestown based devices will last about 5 hours when browsing the web on WiFi. Talk times are expected in the 4 - 5 hour range over 3G, and 8 - 10 hours on 2G.

If these numbers hold true in shipping Moorestown devices, I’d expect to see anywhere from iPhone to iPhone 3GS levels of battery life. Audio decoding seems good, while other aspects like video playback aren’t so great. Web browsing power consumption really varies based on the test. I measured power consumption on my iPhone 3GS and saw 1.1 - 1.3W while loading the AnandTech front page. That would imply Moorestown platform battery life could be competitive.

As soon as I can get my hands on some actual hardware I plan on verifying all of this data myself. Intel claims that the top 5 handset manufacturers see power consumption in the 750mW - 1.5W range, so Moorestown should find itself right in the middle of all of them.

The Intel GMA 600 by Imagination Technologies

The iPhone 3GS, iPad, Motorola DROID and Palm Pre all use Imagination Technologies’ PowerVR SGX mobile GPU. The SGX 535 running at 200MHz was used in Poulsbo, the North Bridge used in the very first Atom MID platform (Menlow). That was a 130nm chip. Intel called it the GMA 500.

Moving the GPU core on-die shrunk it considerably. At 45nm it should occupy roughly 1/8 - 1/10 the space of the GPU at 130nm). The PowerVR SGX 535 in Lincroft can also run at up to 400MHz, although it’s up to the handset vendors themselves to pick the right balance of clock speed vs. power consumption. It’s also possible that different versions of the Atom Z6xx line will have different GPU clocks. The new GPU is called the Intel GMA 600.

To the best of my knowledge all current smartphone implementations of the PowerVR SGX 535 run at 200MHz. This should give Intel the leg up in graphics performance should a vendor choose to run the GPU at such a high clock rate. It’s difficult to tell what impact we’ll see on battery life.

The Display

Lincroft only supports two display interfaces: 1024 x 600 over MIPI (lower power display interface) or 1366 x 768 over LVDS (for tablets/smartbooks/netbooks). 1080p HDMI out is supported Langwell.

Video Decoding Support: H.264 High Profile at up to 20Mbps

Imagination Technologies is also on tap to produce the video decoding hardware used in Lincroft. The PowerVR VXD is also used in the iPhone 3GS and the iPad, it’s here in Moorestown as well.

The implementation in Moorestown, combined with Intel’s caches and memory controller can apparently support 1080p H.264 base, main and high profile content at up to 20Mbps. At 1.1W platform power during video playback, that’s pretty impressive.

Video encoding is supported for the first time, also using ImgTec IP (PowerVR VXE). You get up to 720p30 H.264 base profile L3 video encode with Moorestown. You won’t see 1080p encode support until Medfield.

CPU Performance: Moorestown Rocks?

In my iPad review I pointed out the huge gap between the performance of today’s 1GHz smartphone SoCs and an Atom powered netbook.

It’s impossible to estimate the performance of Moorestown without functional hardware, but we can assume that it’s somewhere in between the ARM based SoCs and the netbook in the chart below.

Intel provided some SPECint numbers comparing Moorestown to various smartphone application processors shipping in 2010 (either now or later). Keep in mind that SPECint is just as much of a compiler benchmark as it is a hardware benchmark, so real world performance could very well differ. We won’t know how well Moorestown stacks up until we can evaluate it ourselves. But if Intel’s numbers are even remotely accurate, this is the sort of leap in performance we honestly need in the smartphone space:

The slowest member of the Atom Z600 series will run at 1.2GHz, the fastest (for smartphones) will run at 1.5GHz. While multithreaded performance on a dual Cortex A9 at 1GHz approaches that of a 1.2GHz Moorestown, nothing can touch the 1.5GHz part. Single threaded performance is just as impressive.

The Sunspider score is super impressive as well. Intel is posting a sub-2s Sunspider score, the best we've seen thus far on a ARM based platform is ~10 seconds on the iPad:

These numbers are from Intel so we have to take them with a grain of salt. And as I already mentioned, we’re looking at pure CPU/compiler performance here - real world application performance is a different story entirely. But it’s a safe bet to assume that Moorestown will at least be faster than any application processor on the market today.

A high clocked dual Cortex A9 could give it a healthy challenge though.

GPU Performance: Moorestown Rules?

Intel provided three numbers to instill confidence in Moorestown’s graphics capabilities. The first was a claim of over 100 fps running Quake 3. I saw this in person so I can confirm that you can actually run a timedemo of Quake 3 at above 100 fps on Moorestown. NVIDIA claims over 40 fps on the Tegra 2 at 720p with AA, but it’s unclear how comparable these numbers are.

The next two numbers are from 3DMark Mobile ES 2.0 using the Taiji and Hoverjet benchmarks:

This is comparing the performance of Moorestown to the lower clocked SGX 535 among other GPUs. The performance improvement is more than 2x.

Again, these came from Intel directly so we can’t vouch for their applicability to the real world.

Availability and Medfield

We got Menlow in 2008. Intel promised Moorestown in 2009/2010. The chips are done, but you won’t see products until the second half of this year. We’ve actually seen Moorestown reference designs at this point so it’s safe to say that we’ll see some devices before the end of the year, but perhaps the most exciting ones won’t appear until later.

In 2011 we’ll meet Medfield. A 32nm shrink of Moorestown that combines Lincroft and Langwell into a single SoC. Medfield will double graphics performance, triple imaging capability (higher MP cameras) and bring full HD encode/decode (Blu-ray on my phone?). A reduction in chip count will mean even smaller form factors, while the move to a single 32nm SoC (rather than 45nm + 65nm) should give us longer battery life for idle, video and web browsing. Things like talk time are more a function of the modem than anything else. When you’re on a call the majority of Intel’s components are almost completely powered down, it’s just the modem and its friends that are sipping power.

Medfield is apparently on track, it’ll be in production next year and Intel told me not to expect any more updates on Medfield until the second half of 2010.

Final Words

This isn’t your netbook’s Atom. Thanks to an incredible amount of integration, power management and efficiency Moorestown has the potential to be the most exciting thing to hit the smartphone market since the iPhone. If Intel can deliver a platform that offers greater than 2x the performance of existing smartphones in the same power envelope it has a real chance of winning the market.

The problems are obvious. Intel is the underdog here, it has no foothold in the market and the established OSes are currently very ARM optimized. Not only does Intel have to get Moorestown off the ground but it also needs a win on the software side as well. MeeGo has to, er, go somewhere if this is going to work out. The one thing I will say is that the expected rarely pans out. The smartphone market in 5 years won't look like an extension of what we see today. Apple and Google dominating the market and running ARM processors is where we are today, I'm not convinced that's where we'll be tomorrow.

Intel's Sean Maloney, heir apparent to the CEO throne, said that to succeed Intel needs 3 out of the top 5 handset guys and a bunch of alternative players. He added, "we feel like we're in good shape for that." In the next 12 - 18 months we should see that come to fruition.

For me however it's more about software and design wins. Intel needs to be in an iPhone, or at least something equally emotionally captivating. It needs a halo product. I believe Intel has the right approach here with Moorestown. To be honest, I've seen the roadmap beyond it and it's very strong. The technology is there. We just need someone to put it to use and that's the part that isn't guaranteed.

Tuesday, March 30, 2010

AMD's Opteron 6174 12 cores "Magny-Cours"

source: anandtech.com
If the Westmere Xeon EP were a car engine, it would've been made by Porsche. With "only" six cores, each core in the new Xeon offers almost twice the performance of the competition. A 32nm CPU that only occupies 248 mm2 the Westmere Xeon EP embodies pure refinement and intelligent performance, both Porsche traits. It's just made in Portland, not Zuffenhausen.

AMD's offering today is very different. Magny-cours is the CPU version of the American muscle car. It's a brutally large 12-core CPU: two dies, each measuring 346mm2 connected by a massive 24 link Hyper Transport pipe. AMD's Magny-cours Opteron has almost two billion transistors and 19.6MB of cache on-die.

12 cores, 692 mm² die, 19.6MB of cache on-die

It's not all raw horsepower though. At 2.2GHz this 12-core monster is supposed to be content with only 80 precious watts, and 115W at most. HT assist also makes an appearance to keep CPU-CPU accesses to a necessary minimum, a problem that could get out of hand with 12 cores otherwise. AMD originally added HT assist with its first 6-core Opterons. So Magny-Cours is a like hybrid V12 Dodge Viper with traction control. Will this cocktail of raw core muscle and energy savings be enough to beat the competitor from Portland?

For once we could not resist the temptations of car analogies. As interesting as we found the Xeon Westmere EP, something was missing: a challenger, a competitor to make things more exiting. In the last review, we just knew that the Xeon X5670 would crush the competition. This time is going to be close. AMD still won’t have a chance if your application does not scale well with extra cores. In that case you are better off with the higher clocked and better per-core performance of the Intel CPUs. But it is unclear if Intel will prevail in truly multi-threaded software now that a grim and determined AMD is willing to offer two CPUs for the price of one just to win the race.

Magny-Cours

You probably heard by now that the new Opteron 6100 is in fact two 6-core Istanbul CPUs bolted together. That is not too far from the truth if you look at the micro architecture: little has changed inside the core. It is the “uncore” that has changed significantly: the memory controller now supports DDR-1333, and a lot of time has been invested in keeping cache coherency traffic under control. The 1944-pin (!) organic Land Grid Array (LGA) Multi Chip Module (MCM) is pictured below.

The red lines are memory channels, blue lines internal HT cache coherent connects. The gray lines are external cache HT connections, while the green line is a simple non coherent I/O HT connect.

Each CPU has two DDR-3 channels (red lines). That is exactly the strongest point of this MCM: four fast memory channels that can use DDR-1333, good for a theoretical bandwidth peak of 42.7 GB/s. But that kind of bandwidth is not attainable, not even in theory bBecause the next link in the chain, the Northbridge, only runs at 1.8GHz. We have two 64-bit Northbridges both working at 1.8 GHz, limiting the maximum bandwidth to 28.8 GB/s. That is price AMD’s engineers had to pay to keep the maximum power consumption of a 45nm 2.2 GHz below 115W (TDP).

Adding more cores makes the amount of snoop traffic explode, which can easily result in very poor scaling. It can get worse to the point where extra cores reduce performance. The key technology is HT assist, which we described here. By eliminating unnecessary probes, local memory latency is significantly reduced and bandwidth is saved. It cost Magny-cours 1MB of L3-cache per core (2MB total), but the amount of bandwidth increases by 100% (!) and the latency is reduced to 60% of it would be without HT-assist.

Even with HT-assist, a lot of probe activity is going on. As HT-assist allows the cores to perform directed snoops, it is good to reach each core quickly. Ideally each Magny-cours MCM would have six HT3 ports. One for I/O with a chipset, 2 per CPU node to communicate with the nodes that are off-package and 2 to communicate very quickly between the CPU nodes inside the package. But at 1944 pins Magny-Cours probably already blew the pin budget, so AMD's engineers limited themselves to 4 HT links.

One of the links is reserved for non coherent communication with a possible x16 GPU. One x16 coherent port communicates with the CPU that is the closest, but not on the same package. One port is split in two x8 ports. The first x8 port communicates with the CPU that is the farthest away: for example between CPU node 0 and CPU node 3. The remaing x16 and x8 port are used to make communication on the MCM as fast as possible. Those 24 links connect the two CPU nodes on the package.

The end result is that a 2P configuration allows fast communication between the four CPU nodes. Each CPU node is connected directly (one hop) with the other one. Bandwidth between CPU node 0 and 2 is twice than that of P0 to P3 however.

Whilte it looks like two Istanbuls bolted together, what we're looking at is the hard work of AMD's engineers. They invested quite a bit of time to make sure that this 12 piston muscle car does not spin it’s wheels all the time. Of course if the underground is wet (badly threaded software), that will still be the case. And that'll be the end of our car analogies...we promise :)

The SKUs

Let's see what Intel and AMD are offering.

Intel Xeon Model	Cores	TDP	Clock Speed	Price	AMD Opteron Model	Cores	TDP	GHz	Price
Intel Xeon W5680	6	130W	3.3GHz	$1663	AMD Opteron 6176 SE	12	105/137W	2.3 GHz	$1386
Intel Xeon X5670	6	95W	2.93GHz	$1440
Intel Xeon X5660	6	95W	2.80GHz	$1219	AMD Opteron 6174	12	80/115W	2.2 GHz	$1165
Intel Xeon X5650	6	95W	2.66GHz	$996	AMD Opteron 6172	12	80/115W	2.1 GHz	$989

Intel Xeon X5677	4	130W	3.46GHz	$1663	AMD Opteron 2439SE	6	105/137W	2.8 GHz	?
Intel Xeon X5667	4	95W	3.06GHz	$1440
					AMD Opteron 6168	12	80/115W	1.9 GHz	$744
Intel Xeon E5640	4	80W	2.66GHz	$744	AMD Opteron 6136	8	80/115W	2.4 GHz	$744
Intel Xeon E5630	4	80W	2.53GHz	$551	AMD Opteron 6134	8	80/115W	2.3 GHz	$523
Intel Xeon E5620	4	80W	2.40GHz	$387	AMD Opteron 6128	8	80/115W	2.0 GHz	$266

Intel Xeon L5640	6	60W	2.26GHz	$996	AMD Opteron 6164 HE	12	65/? W	1.7 GHz	$744
Intel Xeon L5630	4	40W	2.13GHz	$551	AMD Opteron 6128 HE	8	65/? W	2.0 GHz	$523
Intel Xeon L5620	4	40W	1.86GHz	$440	AMD Opteron 6124 HE	8	65/? W	1.8 GHz	$455

The 6176 looks a bit ridiculous as it delivers only 4% more performance at 30% higher power and 20% higher prices. The real reason behind this CPU is to battle another tanker, the Nehalem EX that Intel is going to launch tomorrow. The TDP and clockspeeds of that huge chip are very similar. If your application scales poorly and you don't care about power consumption, the X5677 is your champion; it is probably the fastest chip on the market for applications with low thread counts.

The most interesting parts that AMD offers are the dodeca-core 6174 (2.2GHz), the octal-core 6136 (2.4GHz) and the octal-core low power 6128 (2.0GHz). The 6174 targets those with well scaling multi-threaded applications such as huge databases and virtualized loads. The 8-core 6136 might even be better as most schedulers find it easier to distribute threads and process over a power of 2 cores. Lots of applications also don't scale beyond 16 cores and the chip comes with a 200MHz clockspeed bonus and a very reasonable price.

The 6128 HE is also an interesting one. The 6128 HE might be a good way to reconcile low response times with low power, but we'll have to find that out later.

Benchmark Methods and Systems

First of all, I like to offer my thanks to my colleague Tijl Deneut who helped me out with the complex virtualization benchmarks.

None of our benchmarks required more than 20 GB of RAM. Database files were placed on a two drive RAID-0 Intel X25-E SLC 32GB SSD, with log files on one Intel X25-E SLC 32GB. Adding more drives improved performance by only 1%, so we are confident that storage is not our bottleneck.

Xeon Server 1: ASUS RS700-E6/RS4 barebone
Dual Intel Xeon "Gainestown" X5570 2.93GHz, Dual Intel Xeon “Westmere” X5670 2.93 GHz
ASUS Z8PS-D12-1U
6x4GB (24GB) ECC Registered DDR3-1333
NIC: Intel 82574L PCI-EGBit LAN
PSU: Delta Electronics DPS-770 AB 770W

Opteron Server 1 (Dual CPU): AMD Magny-Cours Reference system (desktop case)

Dual AMD Opteron 6174 2.2 GHz
AMD Dinar motherboard with AMD SR5690 Chipset & SB750 Southbridge
8x 4 GB (32 GB) ECC Registered DDR3-1333
NIC: Broadcom Corporation NetXtreme II BCM5709 Gigabit
PSU: 1200W PSU

Opteron Server 2 (Dual CPU): Supermicro A+ Server 1021M-UR+V
Dual Opteron 2435 "Istanbul" 2.6GHz
Dual Opteron 2389 2.9GHz
Supermicro H8DMU+
32GB (8x4GB) DDR2-800
PSU: 650W Cold Watt HE Power Solutions CWA2-0650-10-SM01-1

vApus/Oracle Calling Circle Client Configuration

First client (Tile one)
Intel Core 2 Quad Q9550 2.83 GHz
Foxconn P35AX-S
GB (2x2GB) Kingston DDR2-667
NIC: Intel PRO/1000

Second client (Tile two)
Single Xeon X3470 2.93GHz
S3420GPLC
Intel 3420 chipset
8GB (4 x 2GB) 1066MHz DDR3

Understanding the Performance Numbers

As Intel and AMD are adding more and more cores to their CPUs, we encounter two main challenges to keep these CPUs scaling. Cache coherency messages can add a lot of latency and absorb a lot of bandwidth, and at the same time all those cores require more and more bandwidth. So the memory subsystem plays an important role. We still use our older stream binary. This binary was compiled by Alf Birger Rustad using v2.4 of Pathscale's C-compiler. It is a multi-threaded, 64-bit Linux Stream binary. The following compiler switches were used:

-Ofast -lm -static -mp

We ran the stream benchmark on SUSE SLES 11. The stream benchmark produces 4 numbers: copy, scale, add, triad. Triad is the most relevant in our opinion, it is a mix of the other three.

Stream TRIAD on 64 bit linux - maximum threads

The new DDR3 memory controller gives the Opteron 6100 series wings. Compared to the Opteron 2435 which uses DDR-2 800, bandwidth has increased by 130%. Each core gets more bandwidth, which should help a lot of HPC applications. It is a pity of course that the 1.8 GHz Northbridge is limiting the memory subsystem. It would be interesting to see 8-core versions with higher clocked northbridges for the HPC market.

Also notice that the new Xeon 5600 handles DDR3-1333 a lot more efficiently. We measured 15% higher bandwidth from exactly the same DDR3-1333 DIMMs compared to the older Xeon 5570.

The other important metric for the memory subsystem is latency. Most of our older latency benchmarks (such as the latency test of CPUID) are no longer valid. So we turned to the latency test of Sisoft Sandra 2010.

	Speed (GHz)	L1 (Clocks)	L2 (Clocks)	L3 (Clocks)	Memory (ns)
Intel Xeon X5670	2.93GHz	4	10	56	87
Intel Xeon X5570	2.80GHz	4	9	47	81
AMD Opteron 6174	2.20GHz	3	16	57	98
AMD Opteron 2435	2.60GHz	3	16	56	113

With Nehalem, Intel increased the latency of the L1 cache from 3 cycles to 4. The tradeoff was meant to allow for future scaling as the basic architecture evolves. The Xeons have the smallest (256 KB) but the fastest L2-cache. The L3-cache of the Xeon 5570 is the fastest, but the latency advantage has disappeared on the Xeon X5670 as the cache size increased from 8 to 12 MB.

Interesting is also the fact that the move from DDR2-800 to DDR3-1333 has also decreased the latency to the memory system by about 15%. There's nothing but good news for the 12-core Opteron here: more bandwith and lower latency access per core.

Rendering: Cinebench 11.5

The old Cinebench 10 benchmark was limited to 16 threads. Luckily, the new Cinebench 11.5 does not have that limitation. Cinebench only represents a very small part of the 3D Animation market, but the advantage is that this a benchmark that you can perform at home too.

Cinebench 11.5

Although the Opteron 6174 manages to stay close to the newest Xeon, the Xeon is the CPU to get. The reason is that the performance difference will grow as you are rendering smaller and less complex scenes. In those cases, the percentage of the serial code will increase. And Amdahl’s law is unrelenting: in that case the CPU with the highest single threaded performance will win. You also get the benefit of higher single threaded performance when you are modeling.

Rendering: Blender 2.5 Alpha 2

Blender 2.5 Alpha 2
Operating System	Windows 2008 Enterprise R2 (64-bit)
Software	Blender 2.5 Alpha 2
Benchmark software	Built-in render engine

3dsmax 2010 crashed on almost all our servers. Granted, it is not meant to be run on a server but on a workstation. We’ll try some tests with Backburner later when the 2011 version is available. In the meantime, it is time for something less bloated and especially less expensive: Blender.

Blender has been getting a lot of positive attention and judging by its very fast growing community it is on its way to become one of the most popular 3D animation packages out there. The current stable version 2.49 can only render up to 8 threads. Blender 2.5 alpha 2 can go up to 64. To our surprise, the software was pretty stable, so we went ahead and started testing.

If you like, you can perform this benchmark very easily too. We used the “metallic robot”, a scene with rather complex lighting (reflections!) and raytracing. To make the benchmark more repetitive, we changed the following parameters:

The resolution was set to 2560 x 1600
Anti-alias was set to 16
We disabled compositing in post processing
Tiles were set to 8x8 (X=8, Y=8)
Threads was set to auto (one thread per CPU is set).

Let us first check out the results on Windows 2008 R2:

Blender 2.5 Alpha 2 Windows

At first the Opteron 6174 results were simply horrible: 44.6 seconds, slower than the dual Opteron six-core!

Ivan Paulos Tomé, the official maintainer of the Brazilian Blender 3D Wiki, gave us some interesting advice. The default number of tiles is apparently set of 5x5. This result in a short period of 100% CPU load on the Opteron 6174 and a long period where the CPU load drops below 30%. We first assumed that 8x6, two times as many tiles as the number of CPUs would be best. After some experimenting, we found that 8x8 is the best for all machines. The Xeons and six-core Opterons gained 10%, while the 12-core Opteron became 40% (!) faster. This underlines that the more cores you have, the harder they are to make good use of.

Blender can be run on several operating systems, so let us see what happens under 64 bit Linux (Suse SLES 11).

Rendering: Blender 2.5 Alpha 2 on SLES 11

Blender 2.5 Alpha 2
Operating System	SUSE SLES 11, Linux Kernel 2.6.27.19-5-default SMP
Software	Blender 2.5 Alpha 2
Benchmark software	Built-in render engine

Blender 2.5 Alpha 2 Linux

What happened here? Not only is Blender 50 to 70% faster on Linux, the tables have turned. As the software is still in Alpha 2 phase, it is good to take the results with a grain of salt, but still. For some reason, the Linux version is capable of keeping the cores fed much longer. On Windows, the first half of the benchmark is spent at 100% CPU load, and then it quickly goes down to 75, 50 and even 25% CPU load. In Linux, the CPU load, especially on the Opteron 6174 stays at 99-100% for much longer.

So is the Opteron 6174 the one to get? We are not sure. If these benchmarks are still accurate when we test with the final 2.5 version, there is a good chance that the octal-core 6136 2.4 GHz will be the Blender champion. It has a much lower price and slightly higher performance per core for less complex rendering work. We hope to follow up with new benchmarks. It is pretty amazing what Blender does with a massive number of cores. At the same time, we imagine Intel's engineers will quickly find out why the blender engine fails to make good use of the the dual Xeon X5670's 24 logical cores. This is far from over yet…

OLTP benchmark Oracle Charbench “Calling Circle”

Oracle Charbench Calling Circle
Operating System	Windows 2008 Enterprise Edition (64-bit)
Software	Oracle 10g Release 2 (10.2) for 64-bit Windows
Benchmark software	Swingbench/Charbench 2.2
Database Size:	9GB

Calling Circle is an Oracle OLTP benchmark. We test with a database size of 9 GB. To reduce the pressure on our storage system, we increased the SGA size (Oracle buffer in RAM) to 10 GB and the PGA size was set at 1.6 GB. A calling circle tests consists of 83% selects, 7% inserts and 10% updates. The “calling circle” test is run for 10 minutes. A run is repeated 6 times and the results of the first run are discarded. The reason is that the disk queue length is sometimes close to 1, while the subsequent runs have a DQL (Disk Queue Length) of 0.2 or lower. In this case it was rather easy to run the CPUs at 99% load. Since DQLs were very similar, we could keep our results from the Nehalem article.

Oracle Calling Circle

As we noted in our previous article, we work with a relatively small database. The result is that the benchmark doesn't scale well beyond 16 cores. The Opteron 6174 has a 10MB L3 cache for 12 cores, while the Opteron 2435 has 6MB L3 for 6 cores. The amount of cache might explain why the Intel Xeons scale a lot better in this benchmark. For this kind of OLTP workload is the Opteron 6174 not the right choice. To go back to the car analogy earlier: the muscle car is burning rubber while spinning its wheels, but is not making much progress.

SAP S&D 2-Tier

SAP S&D 2-Tier
Operating System	Windows 2008 Enterprise Edition
Software	SAP ERP 6.0 Enhancement package 4
Benchmark software	Industry Standard benchmark version 2009
Typical error margin	Very low

The SAP SD (sales and distribution, 2-tier internet configuration) benchmark is an interesting benchmark as it is a real world client-server application. We decided to take a look at SAP's benchmark database. The results below all run on Windows 2003 Enterprise Edition and MS SQL Server 2005 database (both 64-bit). Every 2-tier Sales & Distribution benchmark was performed with SAP's latest ERP 6 enhancement package 4. These results are NOT comparable with any benchmark performed before 2009. The new 2009 version of the benchmark produces scores that are 25% lower. We analyzed the SAP Benchmark in-depth in one of our earlier articles. The profile of the benchmark has remained the same:

Very parallel resulting in excellent scaling
Low to medium IPC, mostly due to "branchy" code
Somewhat limited by memory bandwidth
Likes large caches (memory latency!)
Very sensitive to sync ("cache coherency") latency

SAP Sales & Distribution 2 Tier benchmark
(*) Estimate

The last time we discussed the SAP S&D 2-tier benchmark, we had to estimate the Xeon X5670 results. Since then HP has benchmarked its latest G6 servers, giving us results for the X5670. The performance is nothing short of astonishing. The dual Xeon X5670 outperforms a quad Opteron 8345 at 2.6 GHz. The Magny-Cours Opteron can only compete based on its somewhat lower price. We doubt that the SAP buyers care about a few hundred dollars though. A quad Opteron 6174 might have a chance against the Nehalem EX performance wise, but the SAP market will probably prefer the extensive RAS list of the Xeon Nehalem EX. The ERP market is most likely going to be dominated by Intel based servers.

Decision Support benchmark: Nieuws.be

Decision Support benchmark Nieuws.be
Operating System	Windows 2008 Enterprise RTM (64 bit)
Software	SQL Server 2008 Enterprise x64 (64 bit)
Benchmark software	vApus + real-world "Nieuws.be" Database
Database Size	> 100GB
Typical error margin	1-2%

The Flemish/Dutch Nieuws.be site is one of the newest web 2.0 websites, launched in 2008. It gathers news from many different sources and allows the reader to completely personalize his view on all this news. Needles to say, the Nieuws.be site is sitting of top of a pretty large database, more than 100 GB and growing. This database consists of a few hundred separate tables, which have been carefully optimized by our lab (the Sizing Server Lab).

Almost all of the load on the database are selects (99%), about 5% of them are stored procedures. Network traffic averages 6.5MB/s and peaks at 14MB/s. So our Gigabit network connection has still a lot of headroom. Disk Queue Length (DQL) is at 2 in the first round of tests, but we only report the results of the subsequent rounds where the database is in a steady state. We measured a DQL close to 0 during these tests, so there is no tangible intervention of the harddisks.

We now use an new even heavier log. As the Nieuws.be application became more popular and more complex, the database has grown and queries have become more complex too. The results are no longer comparable to previous results. They are similar, but much lower.

Nieuws.be MS SQL Server 2008 - New Heavy log!

Pretty amazing performance here. And while AMD gets a pat on the back, it is the hard working people of Microsoft SQL Server team we should send our kudos to. Our calculations show that SQL Server adds about 80% of performance when adding an extra 12 cores, which is simply awesome scaling. The result of this scaling is that for once, you can notice which CPUs have real cores vs. ones that have virtual (Hyper Threading) cores: the 12-core Opteron 6174 outperforms the best Xeon by 20%. The people with transaction databases should go for the Intel CPUs, while the data miners should consider the latest Opteron. The architectures that AMD and Intel have chosen are complete opposites, and the result is that the differences between the different software categories are very dramatic. Profile your software before you make a choice! It has never been so important.

Virtualization & Consolidation

VMmark - which we discussed in great detail here - tries to measure typical consolidation workloads: a combination of a light mail server, database, fileserver, and website with a somewhat heavier java application. One VM is just sitting idle, representative of workloads that have to be online but which perform very little work (for example, a domain controller). In short, VMmark goes for the scenario where you want to consolidate lots and lots of smaller apps on one physical server.

VMWare VMmark
(*) preliminary benchmark data

Cisco has produced the first VMmark score for the Xeon X5600 series. The Cisco server with two X5680s at 3.3GHz achieved an impressive 35.83 score with 26 VMmark tiles. Twenty-six tiles, that is good for 156 VMs! Based on this number we can estimate where the Xeon X5670 will land. The 6174 numbers are based on AMD’s own preliminary data.

VMmark is a clear victory for the Intel CPUs. Contrary to the SAP market, AMD can play the pricing card here. As long as you do not require dynamic resource scheduling, the software licences costs are nowhere like those of typical ERP projects. So the pricing of the hardware matters more. Also, contrary to other applications, there is no bonus for single threaded performance. The usage models of Databases, 3D Animation software and other all include scenarios where a number of cores will be idling while the others are working very hard. In a virtualization scenario where you are running tons of VMs, single threaded performance does not matter. So while Intel is clearly winning here, servers based on the newest Opteron might still be on the shortlist of those looking for good performance per dollar.

vApus Mark I: Performance-Critical Applications Virtualized

Our vApus Mark I benchmark is not a VMmark replacement. It is meant to be complimentary: while VMmark uses runs 60 to 120 light loads, vApus Mark I runs 8 heavy VMs on 24 virtual CPUs (vCPUs). Our current vApus Stressclient is being improved to scale to much higher amount of vCPUs, but currently we limit the benchmark to 24 virtual CPUs.

A vApus Mark I tile consists of one OLTP, one OLAP and two heavy websites are combined in one tile. These are the kind of demanding applications that still got their own dedicated and natively running machine a year ago. vApus Mark I shows what will happen if you virtualize them. If you want to fully understand our benchmark methodology: vApus Mark I has been described in great detail here. We have changed only one thing compared to our original benchmarking: we used large pages as it is generally considered as a best practice (with RVI, EPT).

The current vApus Mark I uses two tiles. Per tile we have 4 VMs with 4 server applications:

A SQL Server 2008 x64 database running on Windows 2008 64-bit, stress tested by our in-house developed vApus test (4 vCPUs).
Two heavy duty MCS eFMS portals running PHP, IIS on Windows 2003 R2, stress tested by our in house developed vApus test (each 2 vCPUs).
One OLTP database, based on Oracle 10G Calling Circle benchmark of Dominic Giles (4 vCPUs).

The beauty is that vApus (stress testing software developed by the Sizing Servers Lab) uses actions made by real people (as can be seen in logs) to stress test the VMs, not some benchmarking algorithm.

Update: we have noticed that the CPU load of Magny-cours is at 70-85%, while the Six-core "Istanbul" is running at 80-95%". As we have noted before, 24 cores is at the limit of our current benchmark until we launch vApus Mark 2. We have reason to believe that the opteron 6174 has quite a bit of headroom left. The results above are not wrong, but do not show the full potential of the 6174. We are checking the CPU load numbers of the six-core Xeon X5670 as we speak. Expect an update in the coming days.

vAPUS Mark I 2 tile test - 24 vCPUs - ESX 4.0

The AMD Opteron 6174 performs well here, but disappoints a bit at the same time. vApus Mark I does not scale as well as VMmark. The reason is simple: as we used 4 virtual CPUs for both the OLTP as the OLAP virtual machine, scaling depends more on the individual applications. One VM with 4 virtual CPUs will not scale as well as 16 VMs sharing the same 4 virtual CPUs. Also, we use heavy database applications that typically like a decent amount of cache. The difference with the Xeon X5670 is small though. Servers based on both CPUs will make excellent virtualization platforms.

Next, the same test with Hyper-V, the hypervisor beneath Windows 2008 R2. We are testing with Hyper-V R2 6.1.7600.16385 (21^st of July 2009).

vAPUS Mark I 2 tile test - 24 vCPUs - Hyper-V

Based on the excellent results of the Dual Opteron 2435 we expected AMD to take the crown in this benchmark, but that did not happen. We only had one week to get all of the Opteron testing done (AMD didn't have any hardware until the last minute), so we could not analyze this in depth. For some reason, the Opteron 6174 does not scale very well in our vApus benchmark. Compared to a 2.2GHz six-core, we only see a 30% increase in performance, about the same as Intel gets out of adding 2 extra cores to their Xeon. Part of the reason might be our benchmark: at the moment we are limited to 24 CPUs. We’ll investigate this in more detail in the coming quarter when vApus v2 is available.

The difference with the Xeon X5670 is small though, and the slightly lower price of the Opteron makes up for the slightly lower performance.

HPC and Encryption Benchmarks

Just a few days prior to today's launch, we were able to get access to the benchmark numbers that Intel and AMD produced for LSDyna’s (Crash simulation) and Fluent (fluid dynamics) from Ansys. The first benchmark is the Ansys Fluent Truck_14 m benchmark.

Ansys Fluent Truck_14m

The next one is LS Dyna “Neon refined revised”.

LS Dyna Neon refined revised

In both cases, the four memory channels and 12 core mix per CPU seem to pay off: AMD can beat Intel again in the HPC benchmarks, although the advantage is small.

Next we ran Sisoft Sandra 2010's encryption benchmark. Do remember that this is a completely synthetic benchmark. A 100% encryption performance advantage might translate in a very small performance advantage in a real world application. For example the code run on a website might only include a small part of encryption code.

Sisoft Sandra Encryption benchmark: AES

Sisoft Sandra Encryption benchmark: SHA

Once the Xeon X5670's AES instructions can do their work, encryption is lightening fast. Here the new Xeon is 19 times faster than its older brother and 9 times faster than the best Opteron. Encryption can be broken up easily in smaller parts, it scales extremely well. The result is that the CPU with the most threads, the Xeon 5670 and Opteron 6174 easily outperform their older brothers in cryptographic hash functions.

Power Consumption

The Magny-Cours Opteron arrived one week ago, which is barely enough time to do virtualization benchmarking. So we have to postpone extensive power testing to a later date. The Opteron 6174 came in a desktop reference system which is in no way comparable to our Xeon X5670 1U server. We do have an six-core Opteron based system which is very similar to the Opteron 6174 reference system: the motherboard is also equipped with the new AMD SR5670 chipset and housed in the same desktop system. We can tell you that the idle power of the Opteron 6174 is a few watts lower than the six-core Opteron 2435. Both throttle back to 800 MHz, but the Opteron 6100 series gets a real C1E mode.

C1E mode can only be entered if all CPUs are idle. In a dual socket system, both CPUs enter C1E or they don’t. C1E mode is entered only after longer periods of inactivity. All cores flush their L1 and L2 caches to the L3-cache. Then all cores are clockgated (C1). Once that happens, the Hyper Transport links are put in a lower power state. This allows the chipset to enter a lower power state as well. Only when all these previous steps are done, both sockets are in C1E. DMA events will make the sockets go out of the C1E state. So C1E probably won't happen much on server systems. The C1E state is only entered if absolutely no processing is happening at all.

The C1E mode can reduce power quite a bit:

Core clocks are turned off (Clockgate C1 state)
L3, North Bridge, and memory controller all divide their clock frequencies (but are not clockgated!)
All HyperTranspor links transition to LS2 low power state (LDT_STOP_L)
DRAM DLL’s disabled
Memory Transitions from precharge power down mode to self refresh mode (low power)

According to AMD, at full load a 1.7GHz 65W ACP Opteron 6164 HEwould consume about 4% more power than a 2.1 GHz 55W ACP 6-core Opteron 2425 HE. AMD measured 225W for the former, 215W for the latter. We measured 263W on the same system at full load with an Opteron 6174. That's 48W more, or about 24W per CPU. Assuming that the low power CPUs were running at their ACP (65W), we can conclude that the 2.2 GHz Magny-Cours needs about 89W. While the new twelve-core Opteron clearly needs a bit more power than the six-core Opteron, it's not a dramatic increase.

Final Words

The beancounters will probably point out that AMD’s strategy of bolting two CPU dies at 346 mm² together is quite costly. But this is the server CPU market, margins are quite a bit higher. Let AMD worry about the issue of margins. If AMD is willing to sell us - IT professionals - two CPUs for the price of one, we will not complain. It means that the fierce competitive market is favoring the customer. The bottom line is: is this twelve-core Opteron a good deal? For users waiting to use it in a workstation we have our doubts. You’ll benefit from the extra cores when rendering complex scenes, but in all other scenarios (quick simple rendering, modeling) the higher clocked and higher IPC Xeon X5600 series is simply the better choice.

Applications based on transactional databases (OLTP and ERP) are also better off with new Xeon. The SAP and our own Oracle Calling Circle benchmark all point in the same direction. Intel has a tangible performance advantage in both benchmarks.

Data mining applications clearly benefit from having “real” instead of “logical” cores. For datamining, we believe the 12-core Opteron is the clear winner. It offers 20% better performance at 20% lower prices, a good deal if you ask us. Intel’s relatively high prices for its six-core are challenged. The increased competition turns this into a buyers market again.

And then there is the most important segment: the virtualization market. We estimate that the new Opteron 6174 is about 20% slower than the Xeon 5670 in virtualized servers with very high VM counts. The difference is a lot smaller in the opposite scenario: a virtualized server with a few very heavy VMs. Here the choice is less clear. At this point, we believe both server CPUs consume about the same power, so that does not help either to make up our minds. It will depend on how the OEMs price their servers. The Opteron 6100 series offers up to 24 DIMMs slots, the Xeon is “limited” to 18. In many cases this allows the server buyer to achieve higher amount of memory with lower costs. You can go for 96 GB of memory with affordable 4 GB DIMMs, while the Intel server is limited to 72 GB there. That is a small bonus for the AMD server.

The HPC market seems to favor AMD once again. AMD holds only a small performance advantage, and this market is very cost sensitive. The lower price will probably convince the HPC people to go for the AMD based servers.

All in all, this is good news for the IT professional that is a hardware enthusiast. Profiling your application and matching it to the right server CPU pays off and that is exactly what set us apart from the average IT professional.

Read More..

ONE STOP IT News, Rumour & Review

Thursday, May 6, 2010

Intel Moorestown and Atom Z600, The Fastest Smartphone Platform

Tuesday, March 30, 2010

AMD's Opteron 6174 12 cores "Magny-Cours"

Rendering: Blender 2.5 Alpha 2 on SLES 11

Categories

Blog Archive

Followers

ONE STOP IT News, Rumour & Review

Thursday, May 6, 2010

Intel Moorestown and Atom Z600, The Fastest Smartphone Platform

Tuesday, March 30, 2010

AMD's Opteron 6174 12 cores "Magny-Cours"

Rendering: Blender 2.5 Alpha 2 on SLES 11

Categories

Blog Archive

Subscribe To

Followers