Deep Learning Assisted Authoring
5 October 2021
Thibault Lambert, Ugo Louche
Aerial View . Digital Elevation Model . Generative Adversarial Network . Guided-Image Generation . Image Translation . World Map

# WorldGen: Painting the world, one layer at a time

Part 2:  The devil is in the details.

## Introduction

In this previous blogwe introduced our terrain generation project using a deep neural network based on GauGAN. This model was trained on real-life geological and topological data to learn the intricate interdependencies between the geography and the appearance of a landscape when viewed from a very high altitude. Previously, we demonstrated impressive results on similarly untrained maps, but quickly pointed out the limitation of our model with input data created by humans due to the complexity of the real-life labels that we train on. We were missing a few steps to make this model user friendly and will dedicate this second blog post to describe the modifications we made to the network to reach our initial goal of making this tool easy to use.

## Simplifying labels

Geological agencies each provide their own set of geological or topographical metadata. We mentioned in the first blog that we used two different data modalities from the United States Geological Survey (USGS)LandCover and LandSurface.

LandCover and LandSurface have 21 and 10 distinct labels, respectively. To simplify user inputs labels, we reduced the number of distinct categories to 7 and 5 for each of the dataset. For example, we regrouped flat plainssmooth plains and irregular plains into a single plain category and regrouped all forest types into a single forest category.

Unfortunately, the biggest drawback of using generative adversarial networks (GANs) is the difficulty in quantifying progress, with no clear loss function to monitor. Assessing the performance of the network had to be reliant on subjective and manual evaluation of the quality of output images, where visual inspection was the key criteria of comparison. Simplifying labels certainly benefitted our artists, but we believed it could also help the model with overfitting by introducing some regularizing noise in our data. As we will later discuss, there is a delicate balance between simplifying inputs and keeping detailed outputs.

Our labels exhibit extremely complex shapes and features:   It is difficult for users to reproduce such patterns. This encouraged us to apply a variety of image filters on our inputs such as median filters or majority filters as shown in following figure. Filtered labels display easier patterns to reproduce and are more consistent with what an artist might produce.

Sadly, despite all our efforts, we were never able to achieve satisfying results by training with filtered input labels. A common issue was that filtered labels overlapped with mismatching areas. While this could sometimes be harmless, it was often detrimental to more contrasted label categories such as land versus water or urban areas versus forests. Moreover, our network was converging towards fuzzier results. Compared to the crisp details typically found in aerial imagery, our results were missing the look of a satellite view.

Therefore, we decided to abandon this idea and started looking for other options. This is when we considered using another neural network to convert user-drawn labels into real-life looking labels. This seemingly reasonable endeavor, that we dubbed the Label2Label model, will be the focus of the next section.

## Label2Label

Although our world generator could generate high-quality and complex terrains, it missed one of the major achievements of GauGAN:  user friendlinessInput labels used to train the model are complex and we could not find a proper way to simplify them without significantly damaging the model outputs. To alleviate this issue, wintroduce Label2Label, a new neural network specifically trained to generate labels with real-life patterns exhibited in our captured data. For simplicity‘s sake, we re-used a GauGAN base for our new model. We probably could have achieved better results by using a simpler model, but we thought the time investment of adapting a new architecture to our problem was not justified given our previous expertise with GauGAN.

Using large filters iteratively (median or majority), we managed to produce labels resembling brushstrokes. We decided to use them as the “simple” input labels reference for this model. In hindsight, although we used the same filter with the same parameters on all our dataset, we wish we had randomly applied different parameters and filters on our datasets to diversify and support different types of inputs.

The above figure demonstrates how the model learns to output complex labels (right) based on “simple” inputs (left), middle column being the ground-truth reference from USGS. We can notice some uncontrollable effects such as the unwanted introduction of new labels at some places, or the difficulty in handling thin labels that tend to disappear after filtering.

Even if imperfect, this model performed well enough to be of interest to our users. Manual tweaking of Label2Label outputs is of course possible and allows modifying and cleaning labels before feeding them to the WorldGen model.

Here is an example of programmer art input labels going through our Label2Label model and then subsequently fed into the WorldGen models.

## Visual guides

Another way we enabled user control over the created terrains was by adding another set of low-resolution inputs, which we termed anchors, that act as visual and elevation guides on generated data. These anchors have a direct impact on the aesthetic of the outputs, as shown in figures below. It is very reminiscent of GauGAN’s multi-modal mode since we exploited a variational auto-encoding (VAE) scheme. These anchors can be of any resolution, but we experienced better results with low-resolution inputs. Results shown use anchors of 1/64th the size of input labels (1/8th in each direction).

In addition to elevation anchors, we allowed the user to specify a mean altitude for the generated patch as an additional information to help inference. Surface color, and, to a lesser extent, elevation can greatly differ depending on the altitude. For example, a plateau does not have the same characteristics as a plain. We helped the model learn this distinction by providing it the average altitude of the current trained data. We were happy to see that this specific information really improved the model convergence and adaptation to different biomes depending on the altitude, as shown in the figure below.

### Altitude : 0m

Image 1 of 3

Same input labels inferred at different average altitudes. Respectively, altitude 0m, 1000m and 2000m.

## Tiles

GauGAN being a heavy-weight neural network and requiring a lot of GPU memory, it meant that we could only train models on 256×256 or 512×512 images. This was unfortunate because we would typically expect high-definition images when looking at satellite views.

Generating high-definition images with machine learning is notoriously difficult; it usually requires an impressive amount of video memory that are only available on professional graphics cards. Anna Frühstück et al. presented TileGAN at SIGGRAPH 2019 with the idea to tile different inferences in the latent space. Although very interesting, we wanted to try something a bit simpler that could retain the precise features of our complex inputs.

Our proposed solution is a naive overlapping tiling scheme, with tiles averaged together with a cosine weight. Although simple, this idea provided good enough results throughout our project and we decided not to investigate further on alternatives. The only downside we experienced was that blending tiles together would sometimes blur out details from inference. This was nonetheless neglectable most of the time due to the detailed complexity of our input labels and overall stability of our model.

Blending with a cosine weight gave more importance to the center of the inferred images, which also helped avoiding artefacts close to borders of the images due to lack of information outside of image borders. Also note that having an average altitude per tile (as described in the previous section) definitively improved the consistency between far away tiles with possibly very different elevation

## Results

After incorporating the above modifications, our proposed WorldGen model learns the intricate rules of ecosystems and reproduce detailed and realistic satellite views on untrained data in high resolution

It also handles modifications to existing topological maps as well.

We detail the training process of this generative adversarial network in this next figure.

Since we could not achieve high-end results by simplifying input labels, we opted for a second Label2Label model that learns how to reproduce the complex patterns of real-life ecosystems and geological maps.

You will find a link below to download some of our raw results. We recommend using a library like ipyvolume to visualize them. Alternatively, you can download baked html file to explore our data interactively from the second link; however, those files are much larger in size.

RGB + elevation PNGs (2.7 GB)

Interactive content (5.7GB)

## Conclusion

We introduced two complementary neural networks capable of generating highly realistic terrains from an aerial perspective. Making this process user-friendly was challenging and leaves considerable room for future work. We will keep investigating ways to improve topo-geographical data representations and define our own artist-friendly labels to improve the overall quality of our models.

## Authors

Thibault Lambert is a Senior R&D Graphics Engineer and joined Eidos-Montréal in 2018. He has worked in both the visual effects and video games industries in the last 13 years and has been lucky enough to work on several well-received and renowned movies and AAA games. He is passionate about real-time graphics and is dedicated to bridging the gap between offline and real-time rendering as much as possible, bringing real-time technologies to the film industry and reproducing complex offline techniques in real-time for video games.

Ugo Louche joined Eidos-Montréal in 2019 as a Machine Learning Specialist. He obtained his PhD in Machine Learning in 2016 under the supervision of Pr. Liva Ralaivola. His research activities were mostly focused on Active Learning and Linear Classification. Ugo believes that Machine Learning has the potential to change how we make games, for the better, by allowing developers to focus on creative tasks and relegating menial work to AIs. He is committed to make this change happen and his work at Eidos-Montréal focuses on all aspects of Machine Learning that can help with game development.

Insert math as
$${}$$