Everyone who follows GPUs knows that AMD had a far more efficient GPU architecture than Nvidia during Nvidia's Fermi generation. That is, comparing the Radeon HD 5000/6000 series to the GeForce 400/500 series. Then with Kepler, Nvidia got a slight lead, comparing the Radeon HD 7000/Radeon R 200 series to the GeForce 600/700 series. With Maxwell, Nvidia has really pulled away, with the GeForce 900 series massively more efficient than the Radeon R 300 series. That's true by whatever metrics you prefer, whether performance per mm^2, performance per watt, performance per $ to build the cards, or whatever. Right?
Wrong. That's because you're only considering graphics. Up through Nvidia's Tesla generation, their GPUs were really only built for graphics, and if you wanted to try to use them for anything else, good luck. With Fermi, Nvidia tried to make an architecture more suited to general-purpose compute. Compared to previous Nvidia GPUs or even the contemporary AMD GPUs, they succeeded. AMD, meanwhile, would keep their GPUs focused purely on graphics for another two years. While AMD GPUs could be used for non-graphical things, their VLIW architectures were very restrictive in what they could handle well, so non-graphical performance was at minimum a pain to code for, and performance would often be dismal.
With GCN, the situation flipped. Now AMD and Nvidia were both trying to make their GPUs work both for graphics and non-graphical compute. But Nvidia was focused more heavily on graphics, while AMD put more non-graphical stuff in. The extra stuff AMD put in came at a cost, and made Kepler slightly more efficient than GCN at graphics. With Maxwell, Nvidia made an architecture focused purely on graphics, and the non-graphical stuff was out entirely.
So what is this non-graphical stuff? The most publicized things are double-precision (64-bit floating point) arithmetic and ECC memory, but they're hardly the only things. FirePro versions of AMD's Hawaii chip of AMD's Hawaii chip (Radeon R9 290/290X/390/390X) absolutely slaughter every other chip ever made in double-precision arithmetic, whether GeForce, Quadro, Tesla, Xeon, Opteron, Xeon Phi, POWER, ARM, Cell, FPGAs, or anything else you can think of. It beats a GeForce GTX Titan X by about a factor of 14. Seriously, fourteen. Getting a Quadro version doesn't help the Titan X, either, and there is no Tesla version. It beats AMD's own Fiji chip by about a factor of five. The nearest competitor is Nvidia's best Tesla chip, which the FirePro beats by about 83%.
It takes a lot of dedicated silicon to offer that sort of world-beating double-precision arithmetic performance. And the silicon to do that is completely disabled in Radeon cards. Not that it would be used at all for graphics even if it weren't disabled. Think that has an impact on making the chip less efficient for graphics? Fiji doesn't have it, which is part of what allows Fiji to be so much more efficient than Hawaii.
Now, double-precision arithmetic tends only to be present in the top end chip of a GPU generation. AMD and Nvidia have figured out that it's not that expensive to design two versions of your shaders for a generation rather than one: one with the double-precision units and one without. But some things that are in primarily for non-graphical compute use aren't so easy to cut out, but rather, filter all the way down the line.
For example, let's consider register space. In launching the Tesla K80, Nvidia used the GK210 chip, which is basically a GK200 with double the register space per SMX, but fewer SMXes to compensate. With 6.5 MB of registers, GK210 has more register space than any other chip Nvidia has ever made. That's considerably less register space than AMD's Tonga, let alone the higher end chips, Hawaii and Fiji. It's a similar story with local memory capacity and bandwidth, where AMD put in massively more of it than Nvidia, and far more than was plausibly useful for games.
Not that long ago, Phoronix reviewed the Radeon R9 Fury X on Linux, and noted that it was substantially slower than the GeForce GTX Titan X at games, but also substantially faster at compute. Their conclusion was that compute works well, but AMD needs to put more work into drivers to get gaming performance up to the standards for how well compute works. Their conclusion was mistaken, however. They didn't realize it, but the difference they were measuring was in silicon, not drivers. While Fiji isn't intended as a compute chip, it has the stuff that all GCN chips have that were put in for non-graphical reasons.
It's also important to understand that this is something can change instantly going from one generation to the next. If, in the next generation, one GPU vendor decides to focus purely on graphics, while the other puts a ton of stuff in for non-graphical compute, the former GPU vendor will predictably be better at graphics and the latter at compute. Either GPU vendor could independently make either decision (or somewhere in between), however, and can make such a decision independently with every new architecture that they make.
That said, with subsequent die shrinks, there may well be less of a need for GPU vendors to pick their trade-offs here. Performance is increasingly limited by power consumption rather than die space. A "typical" full node die shrink can increase your power use per mm^2 by about 40%, as it doubles the transistor count while only reducing power per transistor by about 40%. An ARM bigwig a while back publicly raised the possibility of "dark silicon" on future chips, that is to say, parts of the chip left completely unused because if you make the chip smaller, you can't have enough pads to get data to and from it.
That may make it possible for GPU vendors to put in all of the compute stuff they want, but power gate it off on GeForce and Radeon cards so as not to waste power, while still including all of the graphics stuff that they want. Larger caches as mentioned above don't burn much power. That may be a waste of die space on GeForce and Radeon cards, but if the alternative is dark silicon, so what?