Perhaps we misunderstand the primary use of hyperthreading in games

Quizzical · March 2013

To be clear, by "hyperthreading", I'm talking about the feature in many recent Intel processors in which each core has extra scheduling resources so that a single core can have two threads active at once. If one thread doesn't have anything ready to execute (e.g., because it's waiting to get data from system memory), then the other thread can use the core briefly. This lets a CPU core bounce back and forth between two threads on a nanosecond scale to fill small gaps when a core would otherwise be idle. This is much faster than an OS can switch them back and forth.

Having one core with hyperthreading isn't nearly as good as having two real cores, of course. While hyperthreading lets a single core juggle two threads, it can never have both threads execute something at the same time, as can easily be done with two separate cores.

In gaming benchmarks, a quad core with hyperthreading typically has little to no advantage over a quad core without hyperthreading. The reasons for this are pretty simple: many games don't scale to more than four cores, and even those that do are likely to be video card limited rather than processor limited on a system with four fast cores.

But if you look at dual cores, it's a different picture entirely. Intel's Pentium and Celeron branded processors tend to offer miserable gaming performance. They'll still make many games playable, but they'd be a huge step down from a budget AMD quad core.

Meanwhile, a Core i3 that is nearly the same thing except with hyperthreading often fares much better. Yes, the Core i3 is clocked higher, and has more L3 cache, but often it beats a Pentium dual core of the same architecture by far more than you'd predict from that. Sometimes it beats out a comparably-priced AMD quad core even when a Pentium dual core loses badly. That wouldn't happen in a game that scaled well to four cores. Yet hyperthreading is useless in programs that can't put more than two cores to good use.

For quite a while, I found this rather puzzling. But now that I've been programming a game recently, I think I have the answer.

Games typically have only one CPU thread that communicates with the video card, because if two threads try to talk to the video card at the same time, they'll trip over each other and break everything. Actually, for technical reasons, in my game, there are actually two threads that communicate with the video card: one initializes some things and then dies, and the other isn't allowed to talk to the video card until the first is done. For complicated reasons, this makes the game load faster. Regardless, only one thread can communicate with the video card at a time. While DirectX 11 offers multithreaded rendering to get around this, there are compelling reasons not to use it unless you know that the client has many CPU cores. We'll get there shortly.

Since only one thread can handle the video card, what you do is to have other threads handle everything else except for communicating with the video card, while the one rendering thread mostly just passes along the data that other threads have prepared while doing a little bit of work to keep things organized. The advantage of this approach is that other work doesn't have to wait on passing data to the video card, but can go as soon as it's ready. This means that you're not bottlenecked by having to do a large fraction of the work in the rendering thread, and thus unable to scale well to many CPU cores.

The "problem" comes when the processor can pass data and commands along to the video card faster than the video card can process them. That's an inevitable result of a single drawing command easily being able to cause hundreds of thousands of shader invocations on the video card. What I think happens is that the video drivers just put some things in a queue that the video card will handle when it's ready.

If the queue gets big enough, however, the video card basically tells the rendering thread to stop and wait for the video card to process some data before continuing. This is a good thing: if a video card let you have 20 frames worth of rendering commands sitting in a queue, the resulting display latency would probably make the game unplayable.

The video card might well be ready again a few microseconds later. You don't want to stop the rendering thread entirely for a few milliseconds and potentially leave the video card idle for a while after that. So the rendering thread remains "active" and uses up a core for those few microseconds while it is waiting for the go-ahead from the video card to continue.

If a game is very much GPU-bound, the rendering thread could easily spend 2/3 of its time waiting on the video card. Yet it uses up a CPU core that entire time, even though it's commonly not actually executing any instructions apart from waiting.

And yes, I'm pretty sure that this does happen. In my game, I've tried settings that should be very GPU-heavy with not much CPU load, and I get CPU usage of one and a small fraction cores. Cut the GPU load in half (say, turning off SSAA) without changing the per-frame GPU load and you double the frame rate, while getting CPU usage of one and a slightly less small fraction cores--even though the CPU is now doing twice as much work as before, to prepare twice as many frames.

So what does this have to do with hyperthreading? Having a high-priority thread that is using a core while not executing that many instructions is a tailor-made scenario for hyperthreading to shine. Put two threads on a core and the rendering thread mostly leaves gaps that the other thread can fill, so the other thread on the same core can have performance that is a large fraction of what it would have if it had the core to itself.

That leads to a strange conclusion: hyperthreading is likely to be particularly useful in games that are largely GPU-bound. But that's not as strange as you might think. On a second to second basis, the relative amount of work that a GPU does as compared to a CPU typically doesn't fluctuate that wildly for a given game at given settings on given hardware. That's what you see if you try to measure CPU or GPU load by using Windows Task Manager, CPU-Z, GPU-Z, Catalyst Control Center, or whatever.

But on a millisecond scale, the relative load can fluctuate wildly. At the start of a new frame, the CPU knows that there are a bunch of objects that the game might potentially want to draw, so it can have a bunch of threads process those objects at once to get them ready to draw, and stick them in a queue for the rendering thread. If you're GPU bound, the CPU (other than the rendering thread) may be done with its work for that frame while the video card is only 1/3 done, so the CPU then gets to sit there and wait. To have nearly 100% CPU usage for a few milliseconds and then likely nothing active but the rendering thread for the rest of that frame and bounce back and forth between those two extremes over the course of most frames can easily happen.

Now, that's not a bad thing, really. Remember where I said above that the rendering thread tries to organize things? That's a lot easier to do if you have everything available to be organized than if you only have a few at a time. Switching programs in particular is expensive, so if you can sort a bunch of surfaces to draw by program, you can draw many surfaces between times that you have to switch programs. For example, switch to the program to draw ellipsoids, then draw all of the ellipsoids for the entire frame at once, then switch to the program to draw tree branches, then draw all of the tree branches for the entire frame, and so forth. If you're mostly waiting on the CPU, the rendering thread may get one surface, send it along to the video card, then switch programs to whatever the next one with data ready is, and so forth, and have to switch programs nearly every single time you want to draw anything.

Hyperthreading means that the rendering thread doesn't have to waste a core when it's just waiting for the video card to be ready. For a dual core, that can mean one core available to do "real" work versus two, and yes, that's a huge difference. That can account for the chasm in performance between a Core i3 and a Pentium dual core even in games that don't scale well to four cores. Meanwhile, being unable to use four real cores means that the extra cores of an AMD quad core can't be used to their full effect, and so it may well also lose to the Core i3. Four slower cores don't beat two faster cores if you can't put more than two to good use.

And yet, the real problem here is that you're mostly GPU bound, which is why the games where this happens commonly show a bunch of CPUs bunched near the top, with a Core i3 maybe beating, say, an FX-4300 by several frames per second, but only losing to a Core i7-3770K by about that same margin. Get a faster video card (or just turn down graphical settings that don't put much load on the CPU) and the relative CPU results could easily change.

The upshot is that hyperthreading matters a lot for gaming on a dual core. But if you're buying a new processor today with gaming in mind, you don't want a dual core. Which is almost to say that hyperthreading doesn't actually matter much for gaming at all. But not for the reasons you thought.

Barbarbar · March 2013

This scenario where only one core deals with the GPU, would seem to imply that we are stuck with games only utilising a few cores. One core to deal with the game, and the rest to run errands more or less.

Quizzical · March 2013

No. What you describe as "running errands" is most of the work that the CPU has to do.

The rendering thread mostly just passes data and rendering commands to the video card. Other threads can determine what data needs to be passed.

For example, in every single frame, you have to check everything that is near enough to be loaded to see if it will appear on camera. Most things won't, as they'll be off to the side or behind the camera or some such. So for most of the objects that one could potentially draw, the CPU has to do some work to decide not to draw them.

Even if you do decide to draw something, you have to determine exactly where it is relative to the camera (which changes every frame if the camera moves, even if the object does not), how the object is oriented in your local coordinate system, and various other things. The CPU can run through hundreds of lines of code setting up one surface to be drawn.

What the rendering thread has to do to draw the surface is far less. Here's my exact source code for the contents of the loop to draw one particular type of surface:

SwitchTexture(theSurface.texID);
gl.glUniform2fv(shinyUnif[14], 1, theSurface.shininess);
gl.glUniform2fv(axesUnif[14], 1, theSurface.axes);
gl.glUniform3fv(moveVecUnif[14], 1, theSurface.moveVector);
gl.glUniformMatrix3fv(objMatUnif[14], 1, false, theSurface.rotMatrix);
gl.glDrawArrays(GL4.GL_PATCHES, 0, 4);

The contents of "theSurface" are created by another thread and not modified by the rendering thread. The second through fifth lines basically each tell the video card "use this value for this variable". The first line calls a function I wrote, but it's a very simple function:

static void SwitchTexture(int newTexture) {
if (newTexture != texActive) {
gla.glBindTexture(GL3.GL_TEXTURE_2D, textureList.get(newTexture));
texActive = newTexture;
}
}

It basically says, check to see if the texture we need this time is the same as the one we needed last time. If not, then tell the video card which texture to use for the new surface. And if it is the same texture as the previous one, then do nothing, because the video card already knows which texture to use.

The last line basically tells the video card, "I've already passed all of the data you need, so now draw it".

That's not a lot of CPU-side work, and it's far less than the CPU-side work to set up the data that needs to be passed. It could, however, easily lead to a lot of GPU-side work, as that "now draw something" command at the end can easily lead to many thousands of shader invocations on the video card. That's why Catalyst Control Center reports GPU activity as being steady around 99% and the rendering thread is mostly waiting for permission from video drivers to pass more data along.

Barbarbar · March 2013

Okay, I will stay with and stick to the Perhaps in your title then.

Cleffy · March 2013

I think this will be changing quite soon as a result of GPGPU. If its possible to do both GPGPU functions and rendering on the same piece of hardware it will need a more communication with the CPU which we might see in the next generation of consoles.

KenFisher · March 2013

Cool info. I hadn't heard of multi-cores implementing hyperthreading. I figured it died out after the bad publicity it got regarding cache thrashing.

waynejr2 · March 2013

you can use a priority queue on the cpu cores to do the work needed then hand it off to the GPU.

rounner · March 2013

Thread safety and intelligent loading of textures is rendering 101. Not sure why we needed a wall of text on that.

Your entire premise is flawed anyhow (you are basically saying a graphics card only has one GPU and it sits asynchronously waiting for PCI-Express data transfers to complete before doing anything).

Think about what SLI is and how it fits into your post.

Dont know why you need to keep posting "Im an expert look at me" posts when you are just misinforming everyone.

Quizzical · March 2013

Originally posted by Cleffy
I think this will be changing quite soon as a result of GPGPU. If its possible to do both GPGPU functions and rendering on the same piece of hardware it will need a more communication with the CPU which we might see in the next generation of consoles.

DirectX 11 and OpenGL 4.3 both offer compute shaders. I haven't looked into exactly how they work, but I have the impression that it's a case of, stick whatever computations you want, wherever you want in the pipeline. Well, not absolutely wherever, but at least that you have several choices.

If you try to do GPGPU work for a game using OpenCL or CUDA or some such, then that creates the problem of how to pass data back and forth between your GPGPU program and your rendering program. With GPU PhysX, the answer to that seems to be "with a huge performance hit".

But I'm not sure how useful compute shaders or other GPGPU computations will be for gaming. Geometry shaders and tessellation have already added vastly more versatility than we had several years ago. Geometry shaders read in one primitive (typically a triangle) at a time, but can output as many or as few as you want--and it doesn't have to be of the same size as was read in. Want to read in a triangle and output two? Or twelve? Have at it. Want to read in points and output triangles, or the other way around? Go ahead.

Tessellation, meanwhile, is really just procedurally generated vertex data. While it's ideally suited for the geometrically intuitive approach of topological subdivisions of simplicial manifolds with boundary, there's a lot of other stuff that it can do. For particle effects in my game, the CPU basically says, give me this many particles with this distribution, and then creating the primitives is done in tessellation.

As another example, to transition between the higher detail ground that it draws near a player and the lower detail ground off in the distance, I use the tessellation stages to determine on a patch-by-patch basis which portions of the more distant ground will be drawn as the higher detail nearer ground type and should be discarded from the more distant ground program, and which ones still need to be drawn. That can be done without tessellation, too, but tessellation lets you check one patch and keep or discard the entire thing at once, rather than having to read it in as 32 separate triangles and decide separately for each of them whether it needs to be kept or discarded in geometry shaders. If it's going to be discarded, then doing so in tessellation control shaders before the vertices are processed also saves a lot of work.

Now, more options of how to do things is better, of course. Maybe compute shaders will provide more efficient ways to do things that you could already do in the normal pipeline. And the versatility of GPGPU is absolutely critical if you want to go way off the beaten path and make a game based on voxels or raytracing or some such. But I don't think it's obvious whether GPGPU will ever be useful to more than a relative handful of games.

Quizzical · March 2013

Originally posted by XAPKen
Cool info. I hadn't heard of multi-cores implementing hyperthreading. I figured it died out after the bad publicity it got regarding cache thrashing.

Intel brought back hyperthreading with Bloomfield in late 2008 and has used it in their architectures ever since. They also use it in Atom, though I'm not entirely sure when that started or if Atom had it right from the start.

Windows 7 helped a lot, as if you have two threads doing most of the work and you have multiple cores plus hyperthreading, you want to put those threads on different cores. Vista and earlier were liable to stick both threads on the same physical core, not realizing that they were the same core. Windows 7 and later know which logical threads correspond to the same core, and won't use hyperthreading until it sees a use for more physical cores than you have.

Quizzical · March 2013

Originally posted by rounner
Thread safety and intelligent loading of textures is rendering 101. Not sure why we needed a wall of text on that.

Your entire premise is flawed anyhow (you are basically saying a graphics card only has one GPU and it sits asynchronously waiting for PCI-Express data transfers to complete before doing anything).

Think about what SLI is and how it fits into your post.

Dont know why you need to keep posting "Im an expert look at me" posts when you are just misinforming everyone.

Since when is this thread about thread safety? It's primarily about hyperthreading as used in recent Intel processors, as the thread title states. And more to the point, why hyperthreading seems to offer huge benefits on a dual core in some situations of games that don't scale well to that many CPU cores, while hyperthreading seems to offer little benefit at all for games on a quad core.

When rendering a game, a video card does have to wait for commands to come in from elsewhere. It starts working on commands when it gets them, and lets the CPU keep sending more for a while, as a GPU can have a bunch of things going on at once. Still, if the queue gets too full, it makes the CPU rendering thread stop and wait. I'm not really sure what you're trying to argue there.

In most situations, a computer does have only one GPU in active use. SLI and CrossFire are small niches, and not what this thread is about, anyway. I'm not sure how SLI and CrossFire work internally. I do know that OpenGL doesn't have built-in SLI or CrossFire capabilities in the core profile. That means that it's either done in extensions (which I doubt) or video drivers. A cursory check of Google seems to point toward the latter.

My best guess is that video drivers send some commands to both video cards (e.g., buffering textures) and some to only whichever one is drawing the frame that the CPU is working on at the time (e.g., rendering commands), and let the CPU queue up vastly more commands so that the CPU will think one frame is done and start on the next to keep the one GPU busy while the other continues working on the previous frame. That's the only way that I can think of that it could be done in video drivers unless the CPU is expected to juggle two frames at once, which would be a major pain to write into game engines to the degree that it would probably relegate SLI to working about as often as GPU PhysX: basically, only when Nvidia pays for it.

Ridelynn · March 2013

So, if a game isn't scaling to 4 cores, and Windows 7+ will keep HT cores parked over actual cores, I don't see the HT cores even coming into play.

Game starts up, requests 40 threads across 2 cores.
Windows scheduler gives you 2 actual cores.
HT doesn't come into play unless you force affinity to a HT core.

I don't think you can make many of these hypothetical analysis without actually seeing how the scheduler assigns these threads on a low core count HT-enabled CPU. There's a lot of talk about it - both on how it uses HT cores for rendering threads, and then how Windows doesn't use it because a real core is faster, but no evidence as to which is actually the case.

There's also, to muddy the waters, the fact that Zambezi (and later) cores are actually dual-moduled cores, and a lot of resources are shared between pairs, not totally dissimilar to HT but totally different implementation and concept - kinda like how a Wankle Rotary is a internal combustion engine, but not really like a traditional reciprocating engine. They both attempt to do the same thing: present more cores available for thread execution while minimizing the amount of silicon and maximizing die utilization (like both engines attempt to turn a crankshaft), but both go about it in a totally different manner.

Ridelynn · March 2013

That, and GPU's are so massively parallel that SLI/CFX is totally a driver issue. The problem isn't multiple GPU's, each GPU is made many SMXes/GCN CU's, each with hundreds of CUDA cores/Stream Processors. Adding another GPU is not terribly hard to imagine: you go from 2000+ SP's to 4000+. Sure, that's a lot of processors to play with, but you already had 2000+, now you just have double that amount to throw at the problem: that isn't a huge jump when you look at the scale of everything.

The problem lay in having to coordinate with them over multiple interfaces (PCI bus amd the bridge connectors). That is all totally up to the driver to select the appropriate rendering mode (AFR/SFR/some specialized modes - this is what part of those Profile updates are), and provide any additional adjustments/customizations on top of that so that each GPU knows what data it is supposed to operate on - and it's been that way since VooDoo Scan Line-Interleave (the first rendition of SLI).

The main crux of the problem is still:
You have a lot of processing power available, both in multi-core CPU's and in these massively parallel GPU devices (be they in SLI or whatever). However, they have to talk to each other, and that is much, much slower than either of those devices can compute.

Howdy, Stranger!

Perhaps we misunderstand the primary use of hyperthreading in games

Comments

Howdy, Stranger!

Quick Links

Perhaps we misunderstand the primary use of hyperthreading in games

Comments