Either the rumored specs were off by a factor of two or else the presentation was extremely misleading to try to make it sound like the cards are twice as fast as they are. There isn't really any in-between here. Either the factor of two is there or it isn't, and either way is rather shocking.
I should explain. GPUs have a lot of different pieces, and there isn't just one number that you can point to and say, this is how fast the card theoretically should be across the board. There are a lot of things that you can compute from the paper specs, though. Arguably the most important is the card's computational power, the theoretical TFLOPS, or trillions of floating-point operations per second. It tends to be pretty strongly correlated with memory bandwidth, among other things, as if you took a balanced card and gave it four times as much memory bandwidth but no additional compute, you wouldn't make it much faster in most games, as the memory controllers would mostly be sitting there idle on such an unbalanced card. Similarly, if you gave it four times as much compute but no more memory bandwidth, the shaders would mostly sit there idle waiting on memory.
The leaked specs on the GeForce RTX 3090 from the box art of a new Gainward card is 5248 shaders and a 1.725 GHz boost clock. You can compute the theoretical TFLOPS from that as (number of shaders) * (clock speed) * 2, or 5248 * 1.725 * 2 * 10^(-9) ~ 18.1. So here's Nvidia's slide from their presentation:https://images.anandtech.com/doci/16060/20200901173540.jpg
36 "shader-TFLOPS". That's double what you'd get from the computations. Card specs do sometimes change late. They don't change at the last minute by a factor of two. Rumors from random people making things up sometimes are off by a factor of two (remember the rumors that the mid-range GF104 was going to be a lot faster than the high-end GF100?), but it doesn't seem plausible that that is the case here.
So where did that factor of two come from? These aren't the only Ampere cards. Nvidia has already announced more specs for their A100 compute card of the same architecture. It has 6912 shaders at about 1.41 GHz, for 19.5 TFLOPS. I don't see where Nvidia explicitly states the shader count, but the TFLOPS number is here:https://www.nvidia.com/en-us/data-center/a100/
It seems implausible that the rumored shader count is off by a factor of two, as that would mean that Nvidia packed in massively more shaders into a GeForce card than they did their A100 compute card that is the largest GPU ever built. Maybe I could believe that if the GeForce card chopped out a ton of compute stuff, but their presentation explicitly said that it includes the full tensor cores. The RTX 3080 is listed as 28 billion transistors, or only a little over half of what the A100 uses.
It also seems implausible that the clock speed is off by a factor of two. GPUs tend to clock a little under 2 GHz these days for reasons of physics, and if they suddenly tried to clock a huge GPU at 3.5 GHz, that's not going to end well.
Most likely, the factor of two is that instead of a * 2 in the computations, they did a * 4 instead. But why?
At this point, it's very reasonable to ask, why should there be a * 2 at all? The answer to that is that all shaders can do one fma per clock, where fma(a, b, c) = a * b + c, all as 32-bit floats. An add is one operation and a multiply is another, so being able to do both at once per clock counts as two. Yeah, that's fishy, but all modern GPU vendors do it, so it at least gives you comparable numbers.
Did Nvidia add some other operation that they think should count as four? The most obvious candidate would be packed half math. That is, rather than putting a 32-bit float in a 32-bit register, you pack two 16-bit halfs in, with a 16-bit number in the high 16 bits and a completely independent 16-bit number in the low 16 bits. Then you have an operation that can do an fma on the high 16 bits and another fma on the low 16 bits at the same time. So that counts as four. But it's not four 32-bit operations. It's four 16-bit operations. This is why GPUs will commonly site one number of TFLOPS at 32-bit precision, and another number at 16-bit. See here, for example:https://www.amd.com/en/products/professional-graphics/instinct-mi50
6.6 TFLOPS double precision
13.3 TFLOPS single precision
26.6 TFLOPS half precision
53 TOPS int8 precision
It takes two 32-bit registers for a 64-bit number. But that 32-bit register can have a full 32-bit float, two 16-bit halfs, or four 8-bit integers. It's no coincidence that halving the data size doubles the throughput, at least if the card has the logic for it, as AMD's Vega 20 does.
But if some other operation is the culprit, why didn't they count it in their A100? Furthermore, the ratio of various operations is dictated by the architecture. If you add twice as many compute units, you get double the throughput of everything. Increase the clock speed by 10% and you get 10% more throughput of everything. The A100 cites peak 312 TFLOPS half precision from tensor cores (= 19.5 TFLOPS single precision * 16), with the 16 having a factor of 2 because of half precision times a factor of 8 because the tensor cores use 8x8 matrices.