What happens if a CPU benchmark finds that a stock Ryzen beats even an overclocked Core i7-7700K at single-threaded performance?http://www.cpuid.com/news/51-cpu-z-1-79-new-benchmark-new-scores.html
I'm reading between the lines here, so this isn't the official explanation and could plausibly be wrong. But I think it's likely that this is what happened.
CPU-Z has a CPU benchmark that at one point, did something to intentionally create a delay in executing code. Higher IPC is all about structuring things so that you can have things ready to execute more often and get more instructions done and fewer stalls while you wait on something or other that you need. Well, that and being able to execute more instructions at once at peak throughput if everything is ready.
CPUs have gotten good at rearranging instructions to avoid having to stop and wait. Modern CPUs don't just wait until they have everything ready to execute the next instruction. If they don't have the data they need for the next instruction, they'll see if they're ready to execute the one after it, and the one after that. This "out of order execution" will intentionally rearrange instructions that don't depend on each other so that they can process whatever is ready without having to wait for something "before" it.
AMD claimed that the SenseMI in Ryzen had a full machine learning algorithm to do a more sophisticated job of this than ever before. But it's hard to see what something is doing in big programs, as a lot of things are going on and a weighted average of a million different things makes it hard to see the outliers. If traditional out of order execution orders things badly on one pass through a for loop, it's likely to order them badly in the same way for the same reasons on the next thousand passes through the loop. SenseMI promised to learn from that and even if the first few passes did things badly, eventually figure out a better way.
Enter synthetic benchmarks. Synthetics don't want to write a hundred thousand lines of code. Rather, it's common to have a small amount of code looped an enormous number of times. That structure isn't limited to synthetics, but it is more common there. This is pretty much the best possible situation for SenseMI, as its cache is very small and can't keep track of the best way to order things all over in a million lines of code.
The CPU-Z benchmark tried to intentionally force a CPU to screw up and stall. I'm not sure why they did this; they may have been trying to more directly incorporate the latency of some particular cache. Ryzen figured out how to fix the problem on the fly and did so. The result was IPC about 30% higher than Sky Lake.
It's important to understand that this is not at all similar to the stories of cheating at GPU benchmarks over the years. GPU drivers have the source code, can recognize particular benchmarks, and compile things differently for those benchmarks. CPUs don't get that; an AMD x86 CPU gets exactly the same compiled binary as an Intel x86 CPU, and it's only a question of how fast it can run that binary.
But let's return to the original question: what do you do if your benchmark shows unexpected results? And then let's immediately leave that question again for an analogy.
Let's suppose that you're a political pollster. 90% of the people you contact won't take the time to answer your questions, and the 10% who do aren't representative of the 90% who don't. People who will vote don't have the same opinions as people who won't vote, and You've got your own secret sauce to adjust for this, as do all other pollsters.
So you're polling some race and by your standard methodology, the race looks like it's about tied. So you look around and see that all of the other pollsters show candidate A up by about 5% over candidate B. What do you do?
What a lot of pollsters will do is to look through their crosstabs and say, well, I've overweighted this and underweighted that, and that unfairly benefited candidate B. If I fix that, I'll have candidate A up by 5% like everyone else. And then maybe you do exactly that. If everyone else had candidate B up by 5%, you could have done something just as legitimate to show candidate B up by 5%. You don't trust the method you chose before seeing the data, but change it to show the expected result. And so you get a variation in results between pollsters smaller than theoretically expected from random noise if they all used exactly the same methodology--which they don't.
Well, that's what CPU-Z just did. They changed their benchmark to get the expected result of Sky Lake having higher IPC than Ryzen. Can't have a different conclusion from everyone else, after all.
Benchmarking is hard, and absolutely has a lot of ways that you can tweak things this way or that to skew the results however you want. It's perfectly legitimate for one benchmark to show product A as being much faster than product B while another shows product B as much faster than product A. Getting a variety of benchmarks like that gives you good information about their relative strengths and weaknesses.
But there is a temptation toward wanting a single, unified benchmark that gives "typical" overall results, even though there is no such thing as a typical or average workload. Or perhaps in the CPU world, one for typical single-threaded results and another for typical highly threaded results. But there is no such thing as typical for either of those.
The advantage that political pollsters have is that they occasionally get to see ground truth. Elections actually happen now and then. If everyone else had candidate A winning handily while you had the race essentially tied, and the election is so close that they have recounts, you won that round.
Hardware benchmarkers don't get that. If everyone else has hardware A faster than hardware B, but you have hardware B faster, people think you're "wrong", and there's no election coming that can possibly vindicate you. And so you get benchmark herding worse than the poll herding. The solution is not to rely on any one particular benchmark as representative of the rest, but look at different benchmarks that tell you different things. Good reviews do exactly that. The outliers aren't necessarily wrong unless you can find some good theoretical reason why they'd be wrong even if they weren't outliers.