2017-11-24

Multithreaded game engines and rendering

With OpenGL

There is this one question since beginning of multicore processors: How to multithread your game engine? I can't answer this question in general, but I can answer the question How to multithread the rendering part of my game engine?

Problem

The most simple case is a single-threaded game engine that does this:

 while(true) {
   simulateWorld();
   renderWorld();
 }

The update step produces a consistent state of the world. After each update, the rendering is done. After the rendering is done, the next cycle is processed. No synchonization needed, everything works. The bad thing: simulateWorld is a cpu-heavy task, renderWorld is a gpu-heavy task. So while the world is updated, the gpu idles, while the gpu renders, the cpu idles (more or less). And the worst thing: Your framerate is limited by both the cpu and the gpu. Your frametime is timeOf(simulateWorld) + timeOf(renderWorld). Our target frametime is timeOf(renderWorld) and to keep this time as low as possible. More precisely the target is timeOf(renderWorld).

Foundations

Okay I lied: The above mentioned problem is only the overall-globally-problem we solve. There's a shitload of other problems, that mostly depend on the language platform, graphics api, hardware, your skillset and so on. Since our lifetime is limited, and I want to give practical, real-world-relevant help, I make some assumptions. One of these is, that you use OpenGL and a pretty new version of it. Let's say 4.3. Multithreaded rendering could probably be done better and faster with DX 12 or Vulkan, but let's stay cross-platform and simpler with OpenGL.

I assume that you already have at least a little bit knowledge about how game engines work, timesteps, scene representations and so on.

Multiple gpu contexts

In order to be able to issue commands to OpenGL, you have to use a thread that has an OpenGL context bound to it. Architecture-wise, there is the CPU, the OpenGL driver and the GPU. The driver holds a queue, where the commands from the SPU are buffered somehow (implementation dependend). Then again, there is a queue for the GPU - that again buffers commands that came from the driver (implementation dependend). First, all commands the CPU issue are asynchronous, which means there's no actual work done on the GPU yet, but the command is queued by the driver. The order is preserved, that's what you can count on. The implementation can decide how many commands are queued up, until the cpu thread will block on command issuing. And the implementation can decide how many commands are buffered on the GPU side of things.

Now, having that said, let's think about multithreading. What we can control is the CPU side (unless you write your own driver, which I doubt you want to do). Multiple cpu threads would require multiple gpu contexts - one for each thread. While this is theoretically possible with OpenGL, it is a very dumb idea in 19 out of 20 cases. As with traditional multithreading, context switching comes with a very high overhead - so you have to ensure that context switching overhead doesn't eat up the performance you gained from using multiple threads. And here's what's wrong: It seems that all OpenGL drivers (where you issue your comamnds at) except the one for iOS, are implemented singlethreaded. That means you can issue commands from multiple cpu threads, but on the driver thread, those commands are processed sequentially, with a context switch in between. You can find a more detailled explanation here, so that I don't have to lose many further words, but: Don't use multiple contexts for anything else than asynchronous resource streaming.

I assume we only use one OpenGL context. That means, all OpenGL calls must happen on a single thread. The construct used for this is the command pattern. A few hints: You should issue as few commands to OpenGL as possible. You should issue as lightweight commands as possible. Obviously, using only one thread for command execution limits the amount of work that can be done by this (cpu) thread. Nonetheless, you want a single, inifintely running worker thread that owns your OpenGL context and takes commands out of a non-blocking command queue. For convenience, you can seperate commands that should return a result (hence block) or should just be fired and forgotten. It's not nessecary that all commands are properly implemented by yout context wrapper/worker thread class, even though it makes your application more predictable in terms of performance. I implemented a non-blocking, generic command queue in Java here that uses Runnables and Callables as Commands. The principle is also explained more detailed here.

 Extractor pattern

Extractor pattern is a term one never finds on the internet when searching for information about rendering. The first time I heard this, was when a (professional graphics programming) collegue of mine explained the basics of renderer architectures to me many many years ago. Fancy name for a simple thing: If you share data between two consumers (not exactly threads in this case), you either have to synchronize the access somehow, or you extract a copy of the data to pass it to the renderer in this case. That implies, that you need a synchronized/immutable command-like data structure, that represents all the information you need to push a render command. That plays nicely with the chapter above, you know, command queue etc. In languages other than Java where you have true value types, the implementation could be easier - although having very large renderstate objects copied over and over might be a drawback here. The chapter below shows another way of handling this problem.

Multibuffering

Let's assume that the GPU only consumes renderstate, but doesn't alter it. That means if the GPU calculates any fancy physic, or results that should be passed to the next render cycle, than it isn't seen as part of our traditional renderstate for now.
That means the only producer of renderable state is the CPU. Since we don't want to render a scene where the first half objects are in timestep n and the second half is still in n-1, we have to be able to access a state of our gamestate, that is concise somehow. This is easily achievable with a task-based architecture, that basically has only one update thread, that sequentially crunches down what you want to do in parallel. For example if you have 10 game objects, the updatethread pushes 10 update-commands into a pool of worker threads and after all of them returned a result, we have a concise world state we can use and used all threads our system can push. I don't think, there's an alternative approach to multithreading that can be used in a general purpouse game engine, but I may be wrong. Instead of creating a new renderstate object and pushing a command to the renderqueue, we use something that's known in multithreading environments for ages: multibuffering.

This should be confused with double or triple buffering that is used to display images on your monitor, although the principle behind it is the same.

I skip explaining why double buffering isn't enough to satisfy our needs. Just use triple buffering and be happy, if you can afford the additional memory consumption. This works the following way.

You have three instances of your renderstate. You have a wrapper, that encapsulates these three instances. Instance A is the currently readable state. The renderer uses this state for rendering, so while rendering is in progress, this instance mustn't be touched by anyone else than the renderer. Instance B is the current staging copy of the renderstate. It is the next state, that the renderer will use for the next frame, after the current frame is finished. Copy C is the isntance the update is currently applied on.
Now while this sounds simple, the important thing is, how to swap those instances correctly. Our main purpouse is to keep the GPU busy and to be able to render maximum fps, although one could theoretically (and practically) limit the framerate somehow (vsync, fps cap etc.).

I realized this with a ever-running thread that does push and wait for a render-scene-drawcall constantly. After the frame is finished, the read copy is swapped with the staging copy. It's important, that the renderer get's a fresh copy on every swap. If the update thread isn't fast enough to provide a new updated renderstate copy, the renderer would render a state that is older than the one just drawn. This will cause flickering objects. I realized this with an ongoing counter, and I prevent swapping to a state with a lower number then the current state has. Let's assume your engine and the machine is capable of pushing 60cps update and 60fps framerate.
Let's face the other swap: The update thread updates the write copy C constantly as fast as possible. When finished, the write state becomes the new staging state B, so we have to swap them. We can't do this if the renderer is currently swapping his read and the staging state. So we need a lock for swap(A,B) and a lock for swap(B,C). If rendering can be done with okayish framerates (which we assume), the updatethread can wait (block) until swap(A, B) is finished. If you are using only coherent memory access, than you're donw here. But you wouldn't search for state-of-the-art rendering, if you don't use incoherent access. So let's get to the final ingredient we need for our super fast renderer.

Lowering draw calls and persistent mapped buffers synchronization

At the very beginning, there is the need to reduce the amount of work the gpu has to do, in order to render your scene. There are paths thorugh the OpenGL API that are more expensive than others, and there are paths with very very small overhead, if you can afford the loss of flexibility that comes with using it. This can give you a rough overview about how costly API calls are. I suggest we skip a long explenation and you read this. Summed up: You want unsynchronized buffers for all your data and handle synchronization by yourself. This results in no driver or gpu work for the synchronization, which is crucial for maxing out your performance. Remember that Vulkan was designed to exactly kill the driver overhead of OpenGL. Furthermore, we reduce draw calls and state changes with bindless resources and one of the most important things: indirect and instanced rendering.

Having all these things implemented, there's one last thing to add to our triple buffering: synchronization. Since the driver works asynchronous and the gpu, too, there's no guarantee that the gpu doesn't use the buffer of our current write copy, that the update threads currently writes to. To adress this, one has to insert a fence sync object into the pipeline and associate it with the state, whenever a state becomes the current write state. When the render thread uses this thread and pushing all render commands of the current frame are pushed, the fence signal is pushed. Whenever this signal is processed by the gpu, the update thread can write to the associated state without any problems. The update thread can prepare the renderstate update and wait until the prepared data can be applied. This effectively blocks the update thread, if the renderer is more than 2 frames behind the update thread.
Special care has to be taken if you have any timestep-dependent calculations in your rendeirng step: Since rendering is decoupled form the update-step, you have to calculate the timestep by yourself somehow in the renderthread, based on the last tick and the time passed since then.

That's it

  •  Using the fastest possible paths through the OpenGL API. You need shader storage buffers and indirect (and inntanced) rendering, probably combined with bindless textures
  • Model a renderstate class that can be your complete draw command
  • Run a thread permanently, that permanently pushes rendercommands