I just read that Nvidia has problems supporting DX12 on Maxwell while AMD does not. . Async compute, whatever that is but supposedly gives AMD a edge. just read a little about it this morning. But supposedly it has the developers of Ashes of Singularity posting and blogging about it. A huge post on it over at overclock.net. This been talked about here?

If one GPU vendor can do something and the other can't, and that something isn't obviously incredibly useful, then it's probably not going to be used in games at all outside of titles sponsored by the vendor that can do it. Both DirectX and OpenGL have had compute shaders for several years, allowing you to do whatever computations you need for graphics wherever you want them. I'm not aware off-hand of any games that actually use them, though I could certainly believe that there are such games out there and I'm simply not aware of them.

Indeed, it's a tribute to the increasing compute versatility of GPUs from both vendors that "older" GPUs mostly support DirectX 12 at launch, rather than having to make radical changes in silicon. That typically hasn't happened at all with older versions of the APIs, at least for major versions rather than minor steps.

All AAA games nowadays are console ports. If console games starts utiliziing async shaders for the performance boost, then pc ports will need to use them too otherwise they will run like crap. Current games that don't use them already run like crap, it will just get worse when it gets used. None of the nvidia cards have the hardware for async, it can only emulate it through the driver and it does it badly.

It's important to distinguish between: 1) This particular implementation of an algorithm doesn't run well on Nvidia hardware, and 2) This particular algorithm cannot be made to run well on Nvidia hardware.

The former is going to happen all the time, as code optimized for one architecture isn't necessarily good code for another. That's less of a problem with graphics where what you do is more standardized and predictable, but far more of a problem if you move away from graphics and into compute tasks that the GPU vendors didn't anticipate.

The latter will happen a lot, too, with code that just doesn't make sense to run on GPUs--whether AMD, Nvidia, or anything else. (For example, anything single-threaded.) And there are certainly some algorithms that run far better on one GPU vendor than the other. If you're leaning heavily enough on local memory, for example, even an aging Radeon HD 7970 will have a pretty good shot at beating a GeForce GTX Titan X outright, and a Radeon R9 Fury X will be well over double that.

But a lot of times (1) happens without it being a case of (2). For example, let's consider integer multiplication. On Kepler, only 1/6 of the shaders can do integer multiplication. If you need all integer multiplication all the time, Kepler is going to choke. But if you only need a little bit of it here and there, Kepler will handle it well.

Maxwell, meanwhile, doesn't have an integer multiplication operation at all. Rather, it has to chain together multiple operations to do a simple integer multiply. But if you need all integer multiplication all the time, all the shaders can do it, so Maxwell will beat out Kepler.

Meanwhile, if you want 32-bit integer multiplication on GCN, that takes four passes, so essentially a chain of four operations. That's going to lose badly to Nvidia. Unless, that is, you happen to know that you only need 24-bit integer multiplication and not 32-bit. You probably don't have a 24-bit data type, but if you happen to know that your high eight bits in a 32-bit integer are all zero, you can use mul24. In that case, AMD has that operation in all of the shaders as a full speed operation, and so GCN handily destroys both Kepler and Maxwell.

But if what you really needed fit 24-bit integer multiplication and you ask for 32-bit instead, you make your multiplication slower on GCN by a factor of four. There's no reason to use a mul24 operation on Nvidia GPUs, as that doesn't map to anything they have in silicon. So if you take Nvidia-optimized code that is integer multiplication heavy and could have fit mul24 but doesn't ask for it and run it on AMD, it looks like Nvidia wins by a huge margin. That flips around into a huge win for AMD as soon as you fix the code to use mul24.

This doesn't necessarily mean that you have to write completely independent code for every architecture, or even every GPU vendor. A few #ifdef statements in sections where you know that different architectures strongly prefer that an algorithm be implemented in different ways can often suffice.

You might ask, why do different architectures handle integer multiplication so differently? Because it's not used much for graphics, so the GPU vendors aren't saying, we have to do this and this or else game won't perform well. If you need floating-point multiplication rather than integer, for example, all shaders in all remotely modern GPU architectures can do 32-bit floating-point FMA as a full-speed operation. If something isn't useful for graphics, whether GPU vendors will decide to put it in for compute purposes is less predictable.

## Comments

422,2381) This particular implementation of an algorithm doesn't run well on Nvidia hardware, and

2) This particular algorithm cannot be made to run well on Nvidia hardware.

The former is going to happen all the time, as code optimized for one architecture isn't necessarily good code for another. That's less of a problem with graphics where what you do is more standardized and predictable, but far more of a problem if you move away from graphics and into compute tasks that the GPU vendors didn't anticipate.

The latter will happen a lot, too, with code that just doesn't make sense to run on GPUs--whether AMD, Nvidia, or anything else. (For example, anything single-threaded.) And there are certainly some algorithms that run far better on one GPU vendor than the other. If you're leaning heavily enough on local memory, for example, even an aging Radeon HD 7970 will have a pretty good shot at beating a GeForce GTX Titan X outright, and a Radeon R9 Fury X will be well over double that.

But a lot of times (1) happens without it being a case of (2). For example, let's consider integer multiplication. On Kepler, only 1/6 of the shaders can do integer multiplication. If you need all integer multiplication all the time, Kepler is going to choke. But if you only need a little bit of it here and there, Kepler will handle it well.

Maxwell, meanwhile, doesn't have an integer multiplication operation at all. Rather, it has to chain together multiple operations to do a simple integer multiply. But if you need all integer multiplication all the time, all the shaders can do it, so Maxwell will beat out Kepler.

Meanwhile, if you want 32-bit integer multiplication on GCN, that takes four passes, so essentially a chain of four operations. That's going to lose badly to Nvidia. Unless, that is, you happen to know that you only need 24-bit integer multiplication and not 32-bit. You probably don't have a 24-bit data type, but if you happen to know that your high eight bits in a 32-bit integer are all zero, you can use mul24. In that case, AMD has that operation in all of the shaders as a full speed operation, and so GCN handily destroys both Kepler and Maxwell.

But if what you really needed fit 24-bit integer multiplication and you ask for 32-bit instead, you make your multiplication slower on GCN by a factor of four. There's no reason to use a mul24 operation on Nvidia GPUs, as that doesn't map to anything they have in silicon. So if you take Nvidia-optimized code that is integer multiplication heavy and could have fit mul24 but doesn't ask for it and run it on AMD, it looks like Nvidia wins by a huge margin. That flips around into a huge win for AMD as soon as you fix the code to use mul24.

This doesn't necessarily mean that you have to write completely independent code for every architecture, or even every GPU vendor. A few #ifdef statements in sections where you know that different architectures strongly prefer that an algorithm be implemented in different ways can often suffice.

You might ask, why do different architectures handle integer multiplication so differently? Because it's not used much for graphics, so the GPU vendors aren't saying, we have to do this and this or else game won't perform well. If you need floating-point multiplication rather than integer, for example, all shaders in all remotely modern GPU architectures can do 32-bit floating-point FMA as a full-speed operation. If something isn't useful for graphics, whether GPU vendors will decide to put it in for compute purposes is less predictable.