Adventures in Machine Learning with Real-Time Rendering

Machine learning adventures from a graphics engineer point of view.

*Raytraced ambient occlusion denoised in real-time by our custom neural network. From the Amazon Lumberyard Bistro scene, by 2017 Amazon Lumberyard under License CC BY 4.0*

Introduction

When I first joined Eidos-Montréal in 2018, I was offered a carte blanche to investigate the potential of Machine Learning (ML) in the context of real-time rendering, sometimes called Graphics AI (Artificial Intelligence). Intrigued and curious, I eagerly accepted the offer and started looking into potential applications.

Before starting this job, I did not have any prior knowledge of ML; this project was mostly educational and a way to check if applied ML was production-ready for a video game company. In this blog, I would like to candidly share a high-level view of my experience from the perspective of a newcomer into the world of Graphics AI.

After getting more familiar with the wonderful world of Machine Learning, and some libraries there in, for example TensorFlow and PyTorch, the first project I decided to work on was inspired by the Neural Network Ambient Occlusion (NNAO) paper from Daniel Holden because it looked simple enough for a first test. This paper shows that it is possible to train a small neural network to reproduce ambient occlusion (AO) from the rendered normals and depth.

Neural Network Ambient Occlusion

To achieve this in a prototyping environment, I decided to use NVIDIA’s Falcor. Falcor comes with a Horizon Based Ambient Occlusion (HBAO) algorithm, acting as our reference target. It also fully supports raytracing via the DXR API, which I would later use for an even better-quality AO reference.

The proposed neural network model is a Multi-Layer Perceptron (MLP), easy to execute on a GPU shader without complex operations i.e., not requiring cuDNN or DirectML. Coming from a real–time environment, I wanted to take things a bit further and make sure that we could train fast, iterate easily, and if possible, interactively. Furthermore, the benefit of working in 3D environments is that we can take advantage of game engines and render an unlimited amount of training data (unlike regular machine learning models which train on a fixed number of images), simply by collecting it while playing the game.

In that spirit, I upgraded Falcor to continuously train on data coming from the engine. With asynchronous Python execution, one can smoothly “play the game” while the model is training in the background, using new inputs from current rendering state as soon as a user-defined training step finishes. Each iteration, inference weights are also updated from current model state so that we can interactively monitor the progression of the training directly on the screen with real–time inferred ambient occlusion.

A training step takes around 1 or 2 seconds, and the model can subsequently converge i.e., gives good results close to reference target, in about 30 seconds. Interestingly, the quality of the results quickly plateaus after 30 seconds but adapts to unseen cases when confronted to them.

*Top is reference HBAO, bottom is inferred Neural Network AO (NNAO) result. Some ringing artefacts appear due to under-sampling. From the San Miguel scene, by Guillermo M. Leal Llaguno under CC BY 3.0*

*Top is reference HBAO capture, bottom is inferred Neural Network AO (NNAO) results. From the San Miguel scene, by Guillermo M. Leal Llaguno under CC BY 3.0*

Raytraced Ambient Occlusion

This first project was rewarding because it was simple to setup with extremely fast training and good results. From here, it was time to move on to my second educational project, using raytraced ambient occlusion (RTAO) with the newly introduced DXR raytracing API. For a deeper introduction to raytracing and DXR, please refer to the RaytracingGems online book.

Raytracing ambient occlusion is based on randomly picking directions (also called random sampling) from the hemisphere around the target location and checking if the outgoing directions are occluded. Unfortunately, you need to draw a lot of directions (samples) for the result to converge to the ground truth, which can be prohibitively costly.

*With only a few samples (< 1 second). Scene from the Amazon Lumberyard Bistro scene, by 2017 Amazon Lumberyard under License CC BY 4.0*

*After accumulating a lot of samples (~15 seconds).*

With real–time rendering, because we can only afford a few samples per frame, we randomly sample a small amount, resulting in a lot of noise in the render. To combat this, we use an additional denoising step to smooth out the results and remove the noise associated with random sampling.

Although the denoising step introduces a few approximations, the results are still very high quality and comparatively better than all screen space methods like the HBAO used in the previous NNAO project.

While my original goal was to train NNAO with RTAO as a reference, I got sidetracked by the potential of denoisers and decided to refocus my project on this ongoing research area.

Noise2Noise

During Siggraph 2018, I attended Jaakko Lehtinen’s Noise2Noise presentation, which made me want to try it out immediately upon returning to the studio. In a nutshell, the paper states that you can train a model to compute converged results by only feeding it with noisy inputs. Unlike any other denoising algorithm, Noise2Noise uses another noisy render as the target image, instead of the converged raytracing output.

Given the ease at which we can get noisy images from raytracing, I thought it was a great test case for our raytraced AO. Instead of waiting for the raytracing to converge (i.e., a few seconds) before feeding the result to the model, we can now “instantly” (i.e., a few milliseconds) populate the training dataset, leading to a significant gain in time.

Unlike our previous project, this neural network is more complex and based on the U–Net architecture, leading to longer training times. We show results during training on the Bistro scene (under CC BY 4.0), with inputs (first row) and references (second row) using four samples per pixel. Third row displays results from the neural network model.

*After 116 training iterations. Top row: input noise, middle row: target noise, bottom row: inference results during training*

*After 53572 training iterations. Top row: input noise, middle row: target noise, bottom row: inference results during training*

*Some other results in different parts of the scene*

Although not perfect, I found the quality of those results quite impressive knowing that the neural network model never saw a converged image. My first experiments were using 2 or 4 samples per pixel (results above) but even trying to run the training on a single sample per pixel gave similar results. This opens a lot of doors for other real–time algorithms where we can only afford a few samples per pixel per frame.

Results

This model is a fully convolutional network (FCN), meaning it is independent of image resolution. Hence, even though the model is trained on small 256×256 portions of the framebuffer, final inference can directly be applied on a full resolution framebuffer with the same quality, as shown in the following screenshot:

*Ambient Occlusion inference on entire framebuffer using Noise2Noise neural network*

Of course, ML is neither perfect nor magical and there are a lot of other things to consider and cases that do not work well (yet?). For example, predictability and robustness are some of the most difficult things to achieve with machine learning in general (although I am not an expert and there already are numerous works trying to solve these issues). Model inference on unseen data can give both good results in some places, and poor results elsewhere (see following image). Even if I could add these problematic areas during training, it is impossible to foresee all failures, even in the limited scope of a AAA video game.

*Poor inference results on areas that we did not train on (untrained data)*

Building on top of this, temporal coherence can also be really challenging but several papers now have remarkable results for frame-to-frame transitions.

Conclusion

My foray into Graphics AI was and is a compelling journey. The use of machine learning for a graphics programmer opens a lot of new perspectives even though results usually are a bit unstable and unpredictable. Training and running ML algorithms was surprisingly fast to set up but getting even better results requires exponentially more work and knowledge. While previously mentioned limitations make it currently difficult to exploit ML in the real-time context of computer graphics, I feel confident that it is only a matter of time before video games casually start using such techniques, just like Nvidia is doing with its DLSS technology.

Author

Thibault Lambert is a Senior R&D Graphics Engineer and joined Eidos-Montréal in 2018. He has worked in both the visual effects and video games industries in the last 13 years and has been lucky enough to work on several well-received and renowned movies and AAA games. He is passionate about real-time graphics and is dedicated to bridging the gap between offline and real-time rendering as much as possible, bringing real-time technologies to the film industry and reproducing complex offline techniques in real-time for video games.