Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

AMD steamroller cores are a good architecture--just not for desktops

QuizzicalQuizzical Member LegendaryPosts: 25,351

Earlier this week, AMD launched Kaveri, its first part with Steamroller cores.  People mostly weren't impressed, and while the commentary here was muted, the part could have justified scathing if you only look at the top bin.  People mostly focused on what happens if you clock the CPU around 4 GHz for a TDP in the neighborhood of 100 W.  And there, Kaveri is simply mediocre.

Kaveri isn't good at clocking high for good single-threaded performance, which is what good desktop CPUs need to do.  But it was never meant to be.  We didn't realize what AMD was up to, but now I think I do.

As you probably know, AMD is a much smaller company than Intel.  They don't have Intel's money and they don't have Intel's fabs.  If Intel wants to be good at something, and AMD wants to be good at the same thing, Intel can throw vastly more money at it than AMD and Intel will win.  For AMD to be somewhat worse than Intel at everything would mean that they're always and forever relegated to be a budget option in all categories.  That's not a good way to make money.

But Intel can't be good at everything.  They can't make 20 different CPU architectures to have something ideally optimized for every market imaginable.  For many years, Intel essentially had one architecture.  Atom made it two, and Quark kind of makes it three, if you think Quark matters, which I don't.  But if AMD can build chips that are good at the things that Haswell and Atom aren't, then AMD could be the premium vendor in some markets, and make a lot of money that way.

And that's exactly what AMD has been trying to do.  One area that AMD has focused on is graphics, especially integrated graphics.  That's why AMD bought ATI:  so that they could have vastly better integrated graphics than Intel in a future in which most people used graphics integrated into the same chip as the CPU.

Another area where Intel was deficient is that there was a gaping chasm between Atom and Intel's high end architecture.  Bobcat and now Jaguar cores were designed to fill that gap and did so admirably.  For about 2 1/2 years, about the only reason to buy an Atom chip was not knowing any better, though with Silvermont cores, Atom is finally competitive.

This year, AMD is trying to crush Intel with Steamroller cores about the same way that the Radeon HD 4870 crushed Nvidia in its day.  Given comparably good architectures, more cores clocked lower beat fewer cores clocked higher if your workload scales well to more cores.  Graphics scales well to as many shaders as you can throw at it, which is why the Radeon HD 4870 kicked off an era in which AMD handily beat Nvidia in the various efficiency metrics for three generations before Nvidia's Kepler was forced to adopt AMD's approach of more shaders clocked lower.

But CPU workloads don't scale well to many cores, you say?  That's ot quite true; some CPU workloads don't scale well.  AMD wasn't going to be able to beat Intel in those workloads anyway, so they decided to focus on where they could:  CPU workloads that do scale to many cores.

Even so, AMD had been trying to do this for years.  Barcelona (Phenom) was the first chip with four x86 cores in a single die.  Magny-Cours had 12 Phenom II class cores in one socket.  Interlagos had 16 Bulldozer cores.  Abu Dhabi has 16 Piledriver cores.  The last three of those were all dual-die multi-chip modules.  Magny-Cours fared okay, but the others were crippled by having a bad underlying CPU architecture.  If the CPU cores are bad, a good die configuration of bad cores can't save the chip.  Abu Dhabi was roughly competitive with Intel's Xeon E5 chips in workloads optimally designed to favor AMD--and got completely killed in most others.

With Steamroller cores, AMD finally has the architecture to deliver the product that they wanted.  AMD pushed Kaveri reviews in this direction, though we didn't notice it.  Usually CPU reviews do the top bin to showcase top-end performance.  AMD pushed for more emphasis on the lower power A8-7600--to the degree that some sites only looked at that bin and not the higher power A10-7850K.  Furthermore, AMD pushed for sites to show what the chip could do both at 65 W and 45 W.  At 95 W, it's not that much faster than 45 W--which is why at 95 W, it's an unimpressive chip, even though at 45 W, it's quite nice.

The problem with this is that in a desktop, the difference between 45 W and 95 W doesn't matter much.  But what AMD can--and probably will--do is to make another chip that strips out the GPU and has about 16 or so Steamroller cores in the neighborhood of 3 GHz and a TDP around 125 W.  That would be a compelling chip, and vastly faster than Abu Dhabi in the same TDP.

To be fair, Intel can make a CPU with a lot of cores, too.  But have you seen what Intel charges for them?  The cheapest Xeon on New Egg with more than four cores is $919, and that's for a low bin.  There's plenty of room for AMD to undercut that and make a lot of money.  AMD doesn't need to win 80% of the server market.  If Steamroller cores can get AMD 20% of the server market, that's enough to declare victory.

But servers aren't the only market that would like to clock CPUs closer to 3 GHz than 4 GHz in order to save on power.  There are also laptops.  While laptop workloads don't scale well to many CPU cores any better than desktop workloads, you're not going to put 16 cores in a single laptop socket.  Or 8, for that matter.  There might not be that many desktop or laptop programs that can get much use out of 16 CPU cores, but 3 or 4 is a different matter.  If AMD offers four cores for the price of Intel's two, and AMD's four cores can clock respectably, that could be a nifty product.  With Kaveri, they will.

You could clock Piledriver cores around 3 GHz, too.  But they still used a lot of power, without having the performance to justify it.

You may laugh at AMD's approach of more CPU cores clocked lower, as it's really not very useful in desktops.  But consider that it's already paid off for AMD in some cases.  Note, for example, the Xbox One and the PlayStation 4.  This wasn't just a case of AMD winning purely because they had better graphics to pair with an inferior CPU.  8 Jaguar cores can offer more performance than two Haswell cores in the same low console-friendly TDP.  Nor is it clear that in the same average CPU power usage as the PS4 and Xbone have, even four Haswell cores could have offered more performance.

Comments

  • QuizzicalQuizzical Member LegendaryPosts: 25,351

    I might as well link this, which claims that AMD will make a chip with 16 Steamroller cores:

    http://www.xbitlabs.com/news/cpu/display/20140117215942_AMD_Develops_Behemoth_Chip_with_Sixteen_Cores_on_Single_Die.html

    Note that a steamroller core takes vastly less space than a piledriver core.  If AMD does make that chip, they'll probably make an FX-branded desktop version with fewer cores--possibly from salvage parts of that die.  That's presumably not coming until 2015, though.

  • IGaveUpIGaveUp Member Posts: 273

    Any information on the price positioning?

     

  • plat0nicplat0nic Member Posts: 301
    wasn't impressedd with apu's when they first came out, hoping they are better now

    image
    Main Game: Eldevin (Plat0nic)
    2nd Game: Path of Exile (Platonic Hate)

  • zevianzevian Member UncommonPosts: 403
    Apu's are great for integrated graphics.   Starter gaming rigs, kids computers or for laptops.   Everyone is so intel this and intel that that they overlook the better solution for integrated graphics.
  • RidelynnRidelynn Member EpicPosts: 7,383

    I never had any doubt that Steamroller had a niche.

    The only reason for my muted revelation was that we were expecting something better than Richland on the CPU side. And it wasn't.

    Kaveri APUs are awesome for laptops/netbooks/HTPC - anything where power and heat are real constraints. They make a stop-gap solution for a desktop on a severe budget.

    The problem is, this is where all the APUs since Llano have really been aimed, and while we've seen some really good advances on the graphics, and jumping to full GCN is great in Kaveri, we haven't seen a whole lot on the CPU front, not even looking over the entire history clear back to the K10 architecture.

    I don't think anyone was expecting a Haswell-killer on the CPU front, but I think we were expecting something. APU's aren't all about CPU or GPUs, but to date, all we've really seen are advances in the GPU, whereas the CPU just gets a bit more energy efficient and otherwise stagnant.

    True, a lot of loads where you are going to see a lot of APU's used aren't CPU-limited., even with the slower cores. But there are more than a handful of cases where it's noticeable, and it does impact day to day use. The millions of people yawning isn't from not seeing a huge performance jump while overclocking, it was the fact that even while overclocked to a desktop-only TDP, we still aren't seeing anything really exciting from the CPU side of these APUs, and the GPUs are nice, but not quite enough to get us excited on their own. I don't see, because of Kaveri, all of a sudden APU's being able to see use outside of where we ahve seen them used in the past - and that was what we were hoping that Piledrivers would allow us to do -- not beat out 4870k's, but at least make something more out of APUs than mid-tier laptop and bargain basement PC fodder.

    It doesn't really matter how much money they have to throw at it versus Intel. What matters is what's available on the market, and what it's price and performance metrics are compared to what's on the market. Kaveri doesn't do anything to appreciably change any of that, aside from costing a little bit more to move the graphics needle a little bit further in the niche that APU's have always excelled in.

    To be fair about it - Intel has had the same problem; we've seen very little advance in the CPU side of the house from Sandy->Ivy->Haswell, just some energy efficiency and that's about it (and Intel touting how much better their integrated graphics have gotten). It's just that Intel is so much farther head in the arms race, that they can afford to be stagnant and no one really notices, because their only competition (besides themselves) is so far behind. Energy efficiency is really important, especially in the niches APU's intend to fill, so getting "Just the same" CPU performance in a die that's half the TDP is important, it's just that "Just the same" CPU performance, while "good enough" for most purposes, is significantly behind what is otherwise available.

  • grndzrogrndzro Member UncommonPosts: 1,162

    Kaveri (Steamroller) overclocks close to 5ghz and can overcome the IPC deficit. What many reviewers are finding out is that Kaveri needs a lot of bandwidth. Even more so when overclocked.

    When Kaveri is paired with high speed ram IE 2400+ the performance scales with it. Generic benchmarks don't show this always but game benchmarks do.

    What is really interesting with Kaveri is AMD's Mantle. With hUMA and a dedicated GPU Mantle's asynchronous crossfire will allow many operations that were normally performed on the CPU or the main GPU thread to be split off and run on the iGPU because Mantle is completely multithreaded and optimized to run multiple code paths depending on what is available in the system. Threads that need low latency can be run on the iGPU where the dGPU can run normal GPU tasks. Toss in Mantle's 45% higher performance than DX and you get a very powerful system that matches or exceeds Intel in Mantle titles.

    With 5 developers, 3 engines, and 20 games so far a case can be made that getting a high end Kaveri system to play Mantle titles makes more sense even if you have the money to splurge on a high end Intel system.

  • sacredfoolsacredfool Member UncommonPosts: 849

    Kaveri will get outdated before hUMA and Mantle will be used widely. 

     

    Quizzical, I am not sure about the use for servers. I mean, they could strip the GPU and add more CPU cores but as far as I can see they are totally not heading in that direction. Not only is "adding more cores" not simple, but it goes against the whole ideology of the hUMA architecture. 

    I am more inclined to believe that they'll try to make the GPUs useful for servers then try and beat Xeons. I don't see how they'd achieve integrating GPUs into servers BUT perhaps in 2-3 years....

     

    I agree that Kaveri will be useful for laptops, I've said that ever since I saw AMDs plans -  I am actually puzzled as hell why they didn't release a mobile version first. Kaveri just doesn't make sense for desktops.....

     

     


    Originally posted by nethaniah

    Seriously Farmville? Yeah I think it's great. In a World where half our population is dying of hunger the more fortunate half is spending their time harvesting food that doesn't exist.


  • grndzrogrndzro Member UncommonPosts: 1,162
    Originally posted by sacredfool

    Kaveri will get outdated before hUMA and Mantle will be used widely. 

     

    hUMA is used by mantle. Mantle is being used in over 30 games next year.

  • QuizzicalQuizzical Member LegendaryPosts: 25,351
    Originally posted by sacredfool

    Kaveri will get outdated before hUMA and Mantle will be used widely. 

     

    Quizzical, I am not sure about the use for servers. I mean, they could strip the GPU and add more CPU cores but as far as I can see they are totally not heading in that direction. Not only is "adding more cores" not simple, but it goes against the whole ideology of the hUMA architecture. 

    I am more inclined to believe that they'll try to make the GPUs useful for servers then try and beat Xeons. I don't see how they'd achieve integrating GPUs into servers BUT perhaps in 2-3 years....

     

    I agree that Kaveri will be useful for laptops, I've said that ever since I saw AMDs plans -  I am actually puzzled as hell why they didn't release a mobile version first. Kaveri just doesn't make sense for desktops.....

    In order to make good advantage of a GPU, you need a workload that:

    1)  is embarrassingly parallel,

    2)  is very SIMD-heavy,

    3)  involves little to no branching, and

    4)  involves predominantly 32-bit floating point computations.

    Both AMD and Nvidia have extended this somewhat to handle 64-bit floating point computations in their top end cards.  AMD has also worked to allow 32-bit integer computations to perform passably, though Nvidia has not.  (This is why AMD cards completely slaughter Nvidia in bitcoin mining.)  But the first three requirements make GPUs a complete non-starter for most workloads.

    In contrast, to scale to many CPU cores, all you need is a lot of parallelism.  It can have massive branching and doesn't need to involve any SIMD at all (though if you can, it helps, as CPUs have SSE).  And the parallelism needed to scale to 16 CPU cores is orders of magnitude less than to use a GPU efficiently.

    As for it being hard to make a 16-core CPU, yeah, it is.  But AMD has done so before:

    http://www.newegg.com/Product/Product.aspx?Item=N82E16819113320

    The story I linked above says it's going to be 16 cores on a single die, as opposed to an MCM with two chips having eight cores each like the chip I just linked.

  • bhugbhug Member UncommonPosts: 944

    140119
    Intel skylake 1q16 (14nm, low pwr apu for tablet use, pcie 4, ddr4, sata express, successor to Broadwell 3q13 [the 22nm Haswell 2q13 die shrink])

    xeon, atom (low pwr, cell phone [burst traffic workload]), quark (last year's 32b system on a chip, i.e. wearable devices, and this year's Edison, i.e. WiFi/Bluetooth wireless networking), core (+performance: sandy/ivy bridge), Sandy->Ivy->Haswell;

    barcelona/phenom (4c x86, '03) (K8 a K7 rebuild), 64b extension, memory controller; K9/amd64 2c athlon64x2; K10 4q07 (phenom 65nm), Magny Cours/phenom2 (12c, opteron, 45nm that could max ddr3 1.3GHz @ 42GB/s [wPrime memory bandwidth test]) '08 x3 L2 cache;
    bobcat '10 low pwr, thin netbook ambitions;
    Given an infinite budget across all vectors one could eliminate all performance/power bottlenecks, but you would likely take an infinite amount of time to complete a microarchitecture design;
    taking all of those realities into account usually means making tradeoffs, even when improving a design.
    e.g. We can see an example of a clear tradeoff when AMD stuck with a 2-issue front end for 28nm Jaguar vs Intel's Atom. Not including a decoded micro-op cache and opting for a simpler loop buffer instead is an example of another. AMD likely noticed a lot of power being wasted during loops, and the addition of a loop buffer was probably the best balance of complexity, power savings and cost.
    AMD also improved the instruction cache prefetcher, not because of an over abundance of bandwidth but by revisiting the Bobcat design and spending some more time on the implementation in Jaguar.
    Those IC prefetcher improvements are simply AMD doing things better in Jaguar, not being under the same pressure to introduce a brand new architecture as was the case with Bobcat.
    The instruction buffer between the instruction cache and decoders grew in size with Jaguar, a sort of half step towards the more heavily decoupled fetch/decode stages in Bulldozer.


    New microarchitecture high perf interlagos/bulldozer (eight 2c modules, 32nm, 3q11, server chip); abu Dhabi/piledriver (Thomas Seifert’s “Project Win” minimum possible job for the lowest cost, 2gen bulldozer, 32nm, 4-16c, 2q11, +10% clock, +15% perf) vs Intel Romley two socket Sandy Bridge EP server line is competitive with the top end Xeon on performance, performance per watt, and performance per dollar; Kaveri/steamroller (3gen, 1q14, 2c x86, 28nm, parallelism, +30% instructions/cycle, 856 Gflop); excavator/carrizo (4gen, '15, 20nm, ddr4) Steamroller cores do experience up to a 20% increase in instructions per cycle (IPC) over the Piledriver cores.

    Memory bandwidth can be a huge limit to scaling processor graphics performance, especially since the GPU has to share its limited bandwidth to main memory with a handful of CPU cores. Intel's workaround with Haswell was to pair it with 128MB of on-package eDRAM. AMD has typically shied away from more exotic solutions, leaving the Kaveri looking pretty normal on the memory bandwidth front.
    AMD is focusing on powerful graphics capabilities, Mantle using Kaveri APU (Accelerated Processing Unit) has 2.4 billion transistors, and 47 percent of them are aimed at better, high-end graphics. A-Series chips will range in price from $119 to $173, while the power consumption will range from 45 watts to 95 watts. CPU frequency ranges from 3.1 GHz (gigahertz) to 4 GHz.
    jaguar/bobcat 28nm vs silvermount/intel; "little/incremental advance in cpu other than efficiency, high single-threaded performance" desktop (tablet, netbook, smartphone)...

    image

  • drbaltazardrbaltazar Member UncommonPosts: 7,856
    At op!is AMD stupid ? I know ati inside AMD isn't . why I ask ? AMD has the combined knowhow of ms and Sony and to top it all a huge amount of trick from IBM ! I mean if AMD was just to copy and paste ms trick to an old 6 core of theirs from IBM era even Intel couldnt touch their performance . shrink everything to 22 NM .no redesign no nothing aside from shrink optimization and voila AMD beat Intel at a cheap price . nope by the time they do do this Intel will have done it .already and you know what , Intel isn't upgrading much because they re busy on the GPU side.hell AMD could just copy and paste ms idea on their fx and call it a day .they would gain lot of everything . it is true that there is a lot of room at windows on the software side , but most are compatibility compromise .so those won't stay around very long .ms won't loose their reputation in the name of backward compatibility . it look like next os will ask user (via a web selector)what component user use ,no more autodetect issue!so mantle isn't gonna be as big a deal as AMD think . today's mantle number don't show actual windows real max capability of the is .example ? I ping most game in the 14 to 45 ms range .this is all with tweak already in the w8.1 os .si nope once me as stopped doing th
Sign In or Register to comment.