AMD Ryzen's SenseMI defeats the CPU-Z benchmark

QuizzicalQuizzical Member EpicPosts: 18,151
What happens if a CPU benchmark finds that a stock Ryzen beats even an overclocked Core i7-7700K at single-threaded performance?

http://www.cpuid.com/news/51-cpu-z-1-79-new-benchmark-new-scores.html

I'm reading between the lines here, so this isn't the official explanation and could plausibly be wrong.  But I think it's likely that this is what happened.

CPU-Z has a CPU benchmark that at one point, did something to intentionally create a delay in executing code.  Higher IPC is all about structuring things so that you can have things ready to execute more often and get more instructions done and fewer stalls while you wait on something or other that you need.  Well, that and being able to execute more instructions at once at peak throughput if everything is ready.

CPUs have gotten good at rearranging instructions to avoid having to stop and wait.  Modern CPUs don't just wait until they have everything ready to execute the next instruction.  If they don't have the data they need for the next instruction, they'll see if they're ready to execute the one after it, and the one after that.  This "out of order execution" will intentionally rearrange instructions that don't depend on each other so that they can process whatever is ready without having to wait for something "before" it.

AMD claimed that the SenseMI in Ryzen had a full machine learning algorithm to do a more sophisticated job of this than ever before.  But it's hard to see what something is doing in big programs, as a lot of things are going on and a weighted average of a million different things makes it hard to see the outliers.  If traditional out of order execution orders things badly on one pass through a for loop, it's likely to order them badly in the same way for the same reasons on the next thousand passes through the loop.  SenseMI promised to learn from that and even if the first few passes did things badly, eventually figure out a better way.

Enter synthetic benchmarks.  Synthetics don't want to write a hundred thousand lines of code.  Rather, it's common to have a small amount of code looped an enormous number of times.  That structure isn't limited to synthetics, but it is more common there.  This is pretty much the best possible situation for SenseMI, as its cache is very small and can't keep track of the best way to order things all over in a million lines of code.

The CPU-Z benchmark tried to intentionally force a CPU to screw up and stall.  I'm not sure why they did this; they may have been trying to more directly incorporate the latency of some particular cache.  Ryzen figured out how to fix the problem on the fly and did so.  The result was IPC about 30% higher than Sky Lake.

It's important to understand that this is not at all similar to the stories of cheating at GPU benchmarks over the years.  GPU drivers have the source code, can recognize particular benchmarks, and compile things differently for those benchmarks.  CPUs don't get that; an AMD x86 CPU gets exactly the same compiled binary as an Intel x86 CPU, and it's only a question of how fast it can run that binary.

But let's return to the original question:  what do you do if your benchmark shows unexpected results?  And then let's immediately leave that question again for an analogy.

Let's suppose that you're a political pollster.  90% of the people you contact won't take the time to answer your questions, and the 10% who do aren't representative of the 90% who don't.  People who will vote don't have the same opinions as people who won't vote, and   You've got your own secret sauce to adjust for this, as do all other pollsters.

So you're polling some race and by your standard methodology, the race looks like it's about tied.  So you look around and see that all of the other pollsters show candidate A up by about 5% over candidate B.  What do you do?

What a lot of pollsters will do is to look through their crosstabs and say, well, I've overweighted this and underweighted that, and that unfairly benefited candidate B.  If I fix that, I'll have candidate A up by 5% like everyone else.  And then maybe you do exactly that.  If everyone else had candidate B up by 5%, you could have done something just as legitimate to show candidate B up by 5%.  You don't trust the method you chose before seeing the data, but change it to show the expected result.  And so you get a variation in results between pollsters smaller than theoretically expected from random noise if they all used exactly the same methodology--which they don't.

Well, that's what CPU-Z just did.  They changed their benchmark to get the expected result of Sky Lake having higher IPC than Ryzen.  Can't have a different conclusion from everyone else, after all.

Benchmarking is hard, and absolutely has a lot of ways that you can tweak things this way or that to skew the results however you want.  It's perfectly legitimate for one benchmark to show product A as being much faster than product B while another shows product B as much faster than product A.  Getting a variety of benchmarks like that gives you good information about their relative strengths and weaknesses.

But there is a temptation toward wanting a single, unified benchmark that gives "typical" overall results, even though there is no such thing as a typical or average workload.  Or perhaps in the CPU world, one for typical single-threaded results and another for typical highly threaded results.  But there is no such thing as typical for either of those.

The advantage that political pollsters have is that they occasionally get to see ground truth.  Elections actually happen now and then.  If everyone else had candidate A winning handily while you had the race essentially tied, and the election is so close that they have recounts, you won that round.

Hardware benchmarkers don't get that.  If everyone else has hardware A faster than hardware B, but you have hardware B faster, people think you're "wrong", and there's no election coming that can possibly vindicate you.  And so you get benchmark herding worse than the poll herding.  The solution is not to rely on any one particular benchmark as representative of the rest, but look at different benchmarks that tell you different things.  Good reviews do exactly that.  The outliers aren't necessarily wrong unless you can find some good theoretical reason why they'd be wrong even if they weren't outliers.

Comments

  • TorvalTorval Member LegendaryPosts: 14,757
    Nice insights. I actually went and read the article before I read your post. I've learned doing so makes your points easier to digest and research.

    A warning alarm went off when I read the observation and their first inclination to fix it so the score made sense. I appreciate that they took the time to understand why and how the test falls short, but it concerned me that the answer laid in making the scores comfortable.

    It also bothered me that they designed the test that way in the first place. I get why they do so, simulating real world events, but I don't like it. This sort of thing happens too often in software development where a few smart people try and write an algorithm with the idea that they're going to outsmart all the other smart people. It never ends well.

    I feel like that's what happened here and the lesson wasn't learned because of the solution path they've chosen.

    When looking at synthetics I take them with a huge grain of salt. They just aren't representative of real world problems, almost ever, at least not mine. What I would like to see is benchmark suites that can be configured to execute certain types of workloads.
    The artist or album content may be offensive or controversial.
    Avatar Artist: The Plugz, The Burning Sensations
    Album: Repo Man Soundtrack
    Featured Tracks: Hombre Secreto [Plugz], Pablo Picasso [Burning Sensations]
  • QuizzicalQuizzical Member EpicPosts: 18,151
    I prefer that my synthetic benchmarks be really synthetic.  No effort whatsoever at being typical in any sense, but find out how good the hardware is at one particular thing.  Want to know how good the hardware is at something else not measured by that one benchmark?  Then run a different benchmark that does measure it.  Knowing that particular hardware is good at this and bad at that can help you optimize code.
  • TorvalTorval Member LegendaryPosts: 14,757
    edited May 5
    Finding out what hardware is better at a thing is good for comparison but hard to translate into useful information.

    What does it matter if CPU A is 15% faster than CPU B at a task that has zero real world relevance? What matters to me is whether I need to do massive real world processing of some type and which CPU, will let me get that done faster, or at the least will never be the bottleneck. Benchmarks are mostly not very useful for that.

    I do appreciate the idea of real pure synthetics that do nothing but accurately compare the ability to run the test algorithm. It's just not very helpful information. On the other hand a benchmark developer trying to anticipate and interpret what "real world" means to everyone else isn't very helpful to me either. That's why I find benchmarks fun discussion material, but mostly useless in practice.
    Post edited by Torval on
    The artist or album content may be offensive or controversial.
    Avatar Artist: The Plugz, The Burning Sensations
    Album: Repo Man Soundtrack
    Featured Tracks: Hombre Secreto [Plugz], Pablo Picasso [Burning Sensations]
  • VrikaVrika FinlandMember RarePosts: 4,197
    edited May 5



    Quizzical said:
    Well, that's what CPU-Z just did.  They changed their benchmark to get the expected result of Sky Lake having higher IPC than Ryzen.  Can't have a different conclusion from everyone else, after all.



    Aren't you now forgetting 99.9% of us? That is only true for the small minority who optimize code, and look at benchmarks from that point of view.

    CPU-Z is clearly trying to serve those of us who can never optimize their code, and instead are looking for best processor to run the code on programs we can't change. If they run into a situation where their test results aren't representative of any real-life situation, then they failed in their aim to simulate real use and need to correct the error they made.
    Post edited by Vrika on
     
  • KyleranKyleran Paradise City, FLMember LegendaryPosts: 26,667
    I don't understand any of this. My next gaming laptop will likely have Intel inside like the last 4 before it.

    The simple life is best some times. ;)

    "I need to finish" - Christian Wolff: The Accountant

    On hiatus from EVE Online since Dec 2016 - CCP continues to wander aimlessly

    In my day MMORPG's were so hard we fought our way through dungeons in the snow, uphill both ways.

    Don't just play games, inhabit virtual worlds™
    "This is the most intelligent, well qualified and articulate response to a post I have ever seen on these forums. It's a shame most people here won't have the attention span to read past the second line." - Anon




  • cheyanecheyane EarthMember EpicPosts: 4,864

    Kyleran said:

    I don't understand any of this. My next gaming laptop will likely have Intel inside like the last 4 before it.

    The simple life is best some times. ;)


    Watch out for Malabooga he will come and chide you....whom am I kidding chide is so not his style.
    image
  • OzmodanOzmodan Hilliard, OHMember RarePosts: 8,720

    Kyleran said:

    I don't understand any of this. My next gaming laptop will likely have Intel inside like the last 4 before it.

    The simple life is best some times. ;)


    Well unless it has a dedicated GPU in it, it probably won't.  A Ryzen/Vega chip will own anything from Intel.
  • laseritlaserit Vancouver, BCMember EpicPosts: 5,009

    Ozmodan said:



    Kyleran said:


    I don't understand any of this. My next gaming laptop will likely have Intel inside like the last 4 before it.

    The simple life is best some times. ;)




    Well unless it has a dedicated GPU in it, it probably won't.  A Ryzen/Vega chip will own anything from Intel.


    Why would we want anything less? (besides price of course)

    If I'm going to buy a laptop (these days) it's because I want the power. Either for gaming or for CAD and only because of the portability. Anything else I'd just get a tablet or a surface and that's where integrated graphics are most important. (no room for dedicated graphics)

    "Be water my friend" - Bruce Lee

  • DMKanoDMKano Gamercentral, AKMember LegendaryPosts: 16,970

    cheyane said:



    Kyleran said:


    I don't understand any of this. My next gaming laptop will likely have Intel inside like the last 4 before it.

    The simple life is best some times. ;)




    Watch out for Malabooga he will come and chide you....whom am I kidding chide is so not his style.




    Malabooga got perma banned months ago, so no need to worry
  • DMKanoDMKano Gamercentral, AKMember LegendaryPosts: 16,970

    Kyleran said:

    I don't understand any of this. My next gaming laptop will likely have Intel inside like the last 4 before it.

    The simple life is best some times. ;)




    No need to really, if you go with either CPU that OP mentioned you will be fine.

    What is worrying for AMD is their stock has gone down since Ryzen release.

    Yeah.
  • TorvalTorval Member LegendaryPosts: 14,757

    Vrika said:

    Quizzical said:Well, that's what CPU-Z just did.  They changed their benchmark to get the expected result of Sky Lake having higher IPC than Ryzen.  Can't have a different conclusion from everyone else, after all.
    Aren't you now forgetting 99.9% of us? That is only true for the small minority who optimize code, and look at benchmarks from that point of view.

    CPU-Z is clearly trying to serve those of us who can never optimize their code, and instead are looking for best processor to run the code on programs we can't change. If they run into a situation where their test results aren't representative of any real-life situation, then they failed in their aim to simulate real use and need to correct the error they made.

    I don't disagree with your sentiment, but how is that even realistic? How can they anticipate or how can I relate my workload to their benchmark? They're making a bunch of assumptions about what I'm trying to do.

    Most programs are different enough that it's not helpful to lump them into a generic "not optimized category". What weird stuff are we each trying to do? My weird issues aren't likely to be yours and neither are likely to be represented by their guesses.

    For me at least, that is why benches have very little use. One of my favorite benchmarks is the "PDF from hell test" that Anand uses to test single threaded performance in a rare/fringe real world scenario. Benches make good conversation material though.
    The artist or album content may be offensive or controversial.
    Avatar Artist: The Plugz, The Burning Sensations
    Album: Repo Man Soundtrack
    Featured Tracks: Hombre Secreto [Plugz], Pablo Picasso [Burning Sensations]
  • Jean-Luc_PicardJean-Luc_Picard La BarreMember EpicPosts: 6,606
    edited May 5
    Nowadays, some the very few who optimize their code at assembly language level are those who write synthetic benchmarks. Talk about reflecting the "real world".
    Post edited by Jean-Luc_Picard on
    "The ability to speak doesn't make you intelligent" - Qui-gon Jinn in Star Wars.
    After many years of reading Internet forums, there's no doubt that nor does the ability to write.
    CPU: Core I7 7700k (4.80ghz) - GPU: Gigabyte GTX 980 Ti G1 Gaming - RAM: 16GB Kingston HyperX Savage DDR4 3000 - Motherboard: Gigabyte GA-Z270X-UltraGaming - PSU: Antec TruePower New 750W - Storage: Kingston KC1000 NVMe 960gb SSD and 2x1TB WD Velociraptor HDDs (Raid 0) - Main display: Philips 40PUK6809 4K 3D TV - Second display: Philips 273v 27" gaming monitor - VR: Pimax 4K headset and Razer Hydra controllers - Soundcard: Pioneer VSX-322 AV Receiver HDMI linked with the GPU and the TV, with Jamo S 426 HS 3 5.0 speakers and Pioneer S-21W subwoofer - OS: Windows 10 Pro 64 bits.
Sign In or Register to comment.