Tuesday, March 31, 2009

Intel Nehalem proves its server mettle


source: theinquirer.net, by: Charlie Demerjian

IN ABOUT THREE weeks, we will be coming to the sixth anniversary of the Opteron launch, quite a long time for Intel to be without several key technologies pioneered by that chip. With the launch of the Nehalem EP, Intel finally closed that gap, adopted all those technologies, and has taken the lead in just about every category.


The new cores, great as they may be, are somewhat overshadowed by the uncore parts of the new CPUs. When coupled, the two make a package that wins almost every benchmark out there, very often by large margins. Let's take a look at what this new platform brings to the table.

The first thing people notice is it's a native quad core CPU, not a dual-die MCM like many of it's predecessors. This may seem a trivial point, but does have massive implications with respect to how memory is seen and how caches are shared, a common bottleneck in modern server CPUs. Nehalem has 64K of L1 cache, split equally between Data and Instruction, and 256K of L2 per core. On top of that, there is 8M of inclusive L3 shared between the cores.

Nehalem sports SSE4.2 instructions, has an integrated three-channel DDR3 controller, and two QPI links. If that isn't enough for you, simultaneous multi-threading (SMT/HT) makes a comeback and virtualisation is heavily updated. All of this is done with 731 million transistors packed into 263mm2.

The biggest bang for buck - or in this case million transistors - is undoubtedly QPI, the technology formerly known as CSI. QPI is a point-to-point link like AMD's Hypertransport, but much newer and faster. The idea is that, instead of connecting each CPU to a single chipset, they connect to each other and share data directly. No more bottlenecks, no more shared FSB, much lower latency and, in general, better everything.

If you recall, the core i7 introduced in 1S consumer markets late last year had many of the same features, but not the inter-CPU QPI. The chipset, called Tylersburg, is connected through a similar, but different, interconnect. It all looks like this, with nothing much having changed in the two years since we first printed the diagrams. QPI runs at up to 6.4G transfers per second (25.6 GBps full duplex), so you don't have to wait long for packets.

Almost as important is the integrated memory controller. Nehalem brings that to the table in style, with up to three channels of DDR3-1333 supported. Doing one socket this way isn't a trick, two is hard, four gets downright messy.

Intel has pulled it off, and latency is pretty darn good with worst case numbers showing that remote memory access with DDR3-1067 is still a hair faster than Harpertown with FBD-1600. Local access takes only 60 per cent of that time. Remember, Harpertown had a shared memory controller, and there were far fewer cache coherency problems to deal with. To distribute the load and make it faster at the same time is a huge feat.

One of the bigger gains comes from the 'new way' of doing SMT, and this part is the same as the older i7. If you recall, the much-loved-by-none HT in the P4 generation had the CPU switch gears by pausing one thread and picking up another. There was little efficiency to be gained unless the threads were paused for a long time. That old way is on the left.

The new way is not to switch tasks, but to pull individual ops from each task as functional units open up. This gets a lot more done in far less time than with either the old way, or without SMT. Whereas the old version sometimes showed slight benefits, the new way is a clear win, with 10s of per cent better performance as the norm. If you are going to cast off a stigma, you might as well do it right.

Last up, we have turbo mode. There has been a bunch written about this, and much of it is wrong. The simple story is that based on several factors, the two most notable are heat and power draw, the cores can overclock themselves. Each core is independent, and can go up to three bins of 133MHz up from where they started. A 3.2GHz CPU will hit 3.6 on one or all cores, ambient conditions and load permitting.

Those are the big bangs, but there are enough detail changes like deeper buffers, more usable pipelines, and unaligned cache access to add up to a hugely fast and efficient CPU. VT-d, an enhanced hardware virtualisation method, not only brings the IMC into play for the first time, but also adds some I/O features into the mix. IOAT speeds up network access as well, freeing up more CPU resources to work on things other than packet twiddling.

The chipset is called Tylersburg, a new northbridge, although that is mostly a dead term because the memory controller is on die now, coupled with the older ICH9 or 10. It has 42 PCIe lanes, 36 PCIe2 and six PCIe1. If that isn't enough, you can add a second Tylersburg and nearly double that count. Should 78 PCIe lanes not be enough, you can't add any more now, sorry.

The usual ICH9/10 features are all there as well, including six SATA ports with software RAID5 support. Memory controllers are now on the CPU, but since we are used to talking about it on the chipset side, we won't break tradition now. The three DDR3 controllers can support up to DDR3-1333, with up to three dual-rank DIMMs per channel. This means 144GB of memory on a dual CPU configuration.

So, what do they end up looking like? There are 17 models that you can now buy, from 1S workstation chips to low voltage parts. Fifteen are quad core, two dual core, and the speeds range from 1.86 to 3.2GHz. Not all of them have the full feature set enabled, but most have the big ones available even if some are scaled back. QPI scales down on the slower models, and only the X- prefixed parts allow for the full three bins of turbo.

The gory details

In case you are wondering, X stands for Extreme, W for Workstation, E for Mainstream, and L for Low Voltage. Prices at the low end start out at $188 and go up to a much more reasonable $1600. Reasonable if you own lots of Intel stock that is. TDPs go from a measly 38W to 130W, but 38W is for a dual-core-only part, quads are 60W minimum.

Performance is about what you would expect - stinking fast. How fast? We put a dual socket, quad-core NehalemEP running at 2.93GHz up against it's immediate predecessor, a dual-socket Harpertown at 3.0GHz. The Nehalem had 24G of RAM in six sticks of 4GB DDR3-1333, the Harpertown had 8G of FBD-800. None of the tests we ran are memory bound at 8G, so the disparity in amounts should not matter much. We don't have multiple sticks of 4 and 8G RAM kicking around the lab, so we can't even boot with other configs.

Both of the servers, a Supermicro SYS-6025W-URB for the Harpertowns and an Asus RS700-E6/RS4 with Nehalems used an Intel X25 32G SSDs. With this, you can take HD performance out of the mix, one X25 is faster than most RAID configs with magnetic HDs.

One note: we had originally intended to put Barcelona and Shanghai scores in the mix, but the systems available all used the MCP55Pro chipset. Nvidia chipsets are not only buggier than the mattress of a third-world prostitute, but their Linux drivers are barely functional if you can find them. If we can ever get the damnable things running on Ubuntu, we will get those scores to you.

All tests were run on Ubuntu Linux 9.04 Beta, so number may differer a little when the final release comes out in three weeks. For the tests, we used Phoronix Test Suite v1.8 beta 1, the final version should also be out in a few weeks, but the tests themselves are not going to change, just the control program.

The tests run were the Universe-CLI suite, basically all the ones that didn't need X for graphics. You can download and run the suite yourself, but given the beta state, you might want to wait for v1.8 to be released if you are not all that familiar with Linux. A description of the tests can be found here.

Preamble aside, how did the Nehalem fare? Crushingly well, beating the Harpertown in 34 of 35 tests, and the one that that it lost, Example Network Job, isn't all that CPU centric. The first two columns are Nehalem, the second two are Harpertown.

Scores
This is what a beating looks like

You will see that in this broad range of tests, Nehalem lost once. When it won, it was often by a lot, with tens of percent not uncommon, and several times hitting almost 100 per cent faster. When it came to the RAM and bandwidth scores, well, things got abusive, with scores of 3 and 4x Harpertown in places. Since most tests are not simply core vs core, but system vs systems, the advantages of the Nehalem core tend to get muted. Still......

While the bandwidth scores were not unexpected, the sheer overwhelming number of wins shows that Nehalem is better than the 45nm Core 2s in almost every way. The general consensus is that AMD's new Shanghai CPUs are a little faster than Harpertown, but would lose in most ways to Nehalem. We hope to bring that to you soon, NV drivers willing.

In the end, what can you say, Nehalem isn't a little better, it is much better than the older CPUs. It fixes all the architectural deficiencies, and improves on a good thing everywhere. I can't really see a performance down side here, and that is what the game is all about. ยต

Gripe
When Intel launches a product, it tends to throw a party and webcast the event. This time was no exception. But, in a blinding display of stupidity, they webcast it in a proprietary Windows only format. Not only did they cut out out Linux users AGAIN, but they also cut out Mac users. You know Apple, their flagship partner. (Insert sound of head beating against wall here.)

So, we can't tell you about what was said there, other than the fact that it was likely interesting, but we will never know until Intel gets a clue that open standards are a good thing. Bad Intel, no cookie. Again. This is about the seventh time.

No comments:

Post a Comment