WorldGen: Painting the world, one layer at a time

Part 2: The devil is in the details.

Introduction

In this previous blogwe introduced our terrain generation project using a deep neural network based on GauGAN. This model was trained on real-life geological and topological data to learn the intricate interdependencies between the geography and the appearance of a landscape when viewed from a very high altitude. Previously, we demonstrated impressive results on similarly untrained maps, but quickly pointed out the limitation of our model with input data created by humans due to the complexity of the real-life labels that we train on. We were missing a few steps to make this model user friendly and will dedicate this second blog post to describe the modifications we made to the network to reach our initial goal of making this tool easy to use. 

 

Simplifying labels

Geological agencies each provide their own set of geological or topographical metadata. We mentioned in the first blog that we used two different data modalities from the United States Geological Survey (USGS)LandCover and LandSurface. 

A different LandCover classification used by the North American Land Change Monitoring System (NALCMS). @Copyright NALCMS

LandCover and LandSurface have 21 and 10 distinct labels, respectively. To simplify user inputs labels, we reduced the number of distinct categories to 7 and 5 for each of the dataset. For example, we regrouped flat plainssmooth plains and irregular plains into a single plain category and regrouped all forest types into a single forest category. 

Unfortunately, the biggest drawback of using generative adversarial networks (GANs) is the difficulty in quantifying progress, with no clear loss function to monitor. Assessing the performance of the network had to be reliant on subjective and manual evaluation of the quality of output images, where visual inspection was the key criteria of comparison. Simplifying labels certainly benefitted our artists, but we believed it could also help the model with overfitting by introducing some regularizing noise in our data. As we will later discuss, there is a delicate balance between simplifying inputs and keeping detailed outputs. 

Left, original LandCover (including an undefined area) labels. Right, preprocessed, cleaned and simplified labels. European Corine LandCover (CLC) @copyright Copernicus program

Our labels exhibit extremely complex shapes and features:   It is difficult for users to reproduce such patterns. This encouraged us to apply a variety of image filters on our inputs such as median filters or majority filters as shown in following figure. Filtered labels display easier patterns to reproduce and are more consistent with what an artist might produce. 

Left: merged labels, unfiltered. Right: Median filter applied on left image.

Sadly, despite all our efforts, we were never able to achieve satisfying results by training with filtered input labels. A common issue was that filtered labels overlapped with mismatching areas. While this could sometimes be harmless, it was often detrimental to more contrasted label categories such as land versus water or urban areas versus forests. Moreover, our network was converging towards fuzzier results. Compared to the crisp details typically found in aerial imagery, our results were missing the look of a satellite view. 

Therefore, we decided to abandon this idea and started looking for other options. This is when we considered using another neural network to convert user-drawn labels into real-life looking labels. This seemingly reasonable endeavor, that we dubbed the Label2Label model, will be the focus of the next section. 

 

Label2Label

Although our world generator could generate high-quality and complex terrains, it missed one of the major achievements of GauGAN:  user friendlinessInput labels used to train the model are complex and we could not find a proper way to simplify them without significantly damaging the model outputs. To alleviate this issue, wintroduce Label2Label, a new neural network specifically trained to generate labels with real-life patterns exhibited in our captured data. For simplicity‘s sake, we re-used a GauGAN base for our new model. We probably could have achieved better results by using a simpler model, but we thought the time investment of adapting a new architecture to our problem was not justified given our previous expertise with GauGAN.

Label2Label learns how to reproduce details and patterns (right column) from reference target (middle column) based on “simple” inputs (left column).

Using large filters iteratively (median or majority), we managed to produce labels resembling brushstrokes. We decided to use them as the “simple” input labels reference for this model. In hindsight, although we used the same filter with the same parameters on all our dataset, we wish we had randomly applied different parameters and filters on our datasets to diversify and support different types of inputs.  

The above figure demonstrates how the model learns to output complex labels (right) based on “simple” inputs (left), middle column being the ground-truth reference from USGS. We can notice some uncontrollable effects such as the unwanted introduction of new labels at some places, or the difficulty in handling thin labels that tend to disappear after filtering. 

Even if imperfect, this model performed well enough to be of interest to our users. Manual tweaking of Label2Label outputs is of course possible and allows modifying and cleaning labels before feeding them to the WorldGen model. 

Here is an example of programmer art input labels going through our Label2Label model and then subsequently fed into the WorldGen models. 

Manually drawn LandSurface (top row) and LandCover (bottom row) on the left column. Right column displays results from our Label2Label model, reproducing patterns based on our real-life training data. Based on Siggraph logo from ACM Siggraph.

 

Results from our WorldGen model, using hand drawn inputs modified by our Label2Label model. A thumbnail of elevation generated map is shown in the upper left corner. Note that this is an especially difficult case for our model because of the highly unnatural, perfectly round and geometrical shape of our labels. Based on Siggraph logo from ACM Siggraph.

 

Visual guides

Another way we enabled user control over the created terrains was by adding another set of low-resolution inputs, which we termed anchors, that act as visual and elevation guides on generated data. These anchors have a direct impact on the aesthetic of the outputs, as shown in figures below. It is very reminiscent of GauGAN’s multi-modal mode since we exploited a variational auto-encoding (VAE) scheme. These anchors can be of any resolution, but we experienced better results with low-resolution inputs. Results shown use anchors of 1/64th the size of input labels (1/8th in each direction). 

Three different generated terrains from the same input labels, but with different color low-resolution anchors (thumbnails in upper left corners)

 

Same input labels with different input elevation anchors (upper left corners). Note that not only the final elevation results change, but also the surface color output that remarkably follows rules learned by the neural network.

 

In addition to elevation anchors, we allowed the user to specify a mean altitude for the generated patch as an additional information to help inference. Surface color, and, to a lesser extent, elevation can greatly differ depending on the altitude. For example, a plateau does not have the same characteristics as a plain. We helped the model learn this distinction by providing it the average altitude of the current trained data. We were happy to see that this specific information really improved the model convergence and adaptation to different biomes depending on the altitude, as shown in the figure below. 

Same input labels inferred at different average altitudes. Respectively, altitude 0m, 1000m and 2000m.  

Altitude : 0m
Altitude : 1000m
Altitude : 2000m

 

Tiles

GauGAN being a heavy-weight neural network and requiring a lot of GPU memory, it meant that we could only train models on 256×256 or 512×512 images. This was unfortunate because we would typically expect high-definition images when looking at satellite views. 

Generating high-definition images with machine learning is notoriously difficult; it usually requires an impressive amount of video memory that are only available on professional graphics cards. Anna Frühstück et al. presented TileGAN at SIGGRAPH 2019 with the idea to tile different inferences in the latent space. Although very interesting, we wanted to try something a bit simpler that could retain the precise features of our complex inputs. 

Our proposed solution is a naive overlapping tiling scheme, with tiles averaged together with a cosine weight. Although simple, this idea provided good enough results throughout our project and we decided not to investigate further on alternatives. The only downside we experienced was that blending tiles together would sometimes blur out details from inference. This was nonetheless neglectable most of the time due to the detailed complexity of our input labels and overall stability of our model. 

Each pixel of the high definition (right) is a blend of four different inferences (middle), offset by a half-tile in each direction (left)

 

Blending with a cosine weight gave more importance to the center of the inferred images, which also helped avoiding artefacts close to borders of the images due to lack of information outside of image borders. Also note that having an average altitude per tile (as described in the previous section) definitively improved the consistency between far away tiles with possibly very different elevation 

We can infer small tiles. (e.g., 256×256) and naively stitch them together to get a high-definition result (e.g., 2048×2048). Unfortunately, because each tile is independent, a lot of artefacts appear at seams. We decided to compute a batch of four different inferences, each offset by a half-tile in both dimensions, and then apply a weighted average (upper right corners thumbnails).
Smooth high-definition result after weighted average blend, using a cosine weight centered on each sub-tile.

 

Results

After incorporating the above modifications, our proposed WorldGen model learns the intricate rules of ecosystems and reproduce detailed and realistic satellite views on untrained data in high resolution 

Inference on untrained region (Long Island, USA), using original input labels from USGS.

 

It also handles modifications to existing topological maps as well. 

Generated terrain from tweaked labels from USGS. 3D view using ipyvolume

 

We detail the training process of this generative adversarial network in this next figure. 

Tiled WorldGen model: train a neural network to generate high definition, realistic satellite images from inputs labels with (optional) additional low resolution visual guides.

 

Since we could not achieve high-end results by simplifying input labels, we opted for a second Label2Label model that learns how to reproduce the complex patterns of real-life ecosystems and geological maps. 

Label2Label is our second neural network, turning user drawn input labels into labels that exhibit patterns from real-life topological maps. End results applied on Lab2Lab display more realistic features.

 

Conclusion

We introduced two complementary neural networks capable of generating highly realistic terrains from an aerial perspective. Making this process user-friendly was challenging and leaves considerable room for future work. We will keep investigating ways to improve topo-geographical data representations and define our own artist-friendly labels to improve the overall quality of our models. 

 

Authors

Thibault Lambert is a Senior R&D Graphics Engineer and joined Eidos-Montréal in 2018. He has worked in both the visual effects and video games industries in the last 13 years and has been lucky enough to work on several well-received and renowned movies and AAA games. He is passionate about real-time graphics and is dedicated to bridging the gap between offline and real-time rendering as much as possible, bringing real-time technologies to the film industry and reproducing complex offline techniques in real-time for video games. 

Ugo Louche joined Eidos-Montréal in 2019 as a Machine Learning Specialist. He obtained his PhD in Machine Learning in 2016 under the supervision of Pr. Liva Ralaivola. His research activities were mostly focused on Active Learning and Linear Classification. Ugo believes that Machine Learning has the potential to change how we make games, for the better, by allowing developers to focus on creative tasks and relegating menial work to AIs. He is committed to make this change happen and his work at Eidos-Montréal focuses on all aspects of Machine Learning that can help with game development. 

Llama