WorldGen: Painting the world, one layer at a time

Part 1: As it turns out, the world is a complicated place.

*Horseshoe Bend, Google Earth, ©2020 Landsat / Copernicus*

Introduction

One of our goals at Eidos-Labs is to help artists as much as we can and relieve them of redundant and menial tasks so they can iterate faster and spend more time on interesting assignments. The rise of machine learning (ML, Machine Learning) has naturally led us to investigate potential solutions for artificial intelligence (AI, Artificial Intelligence) assisted content creation.

In 2019, Eidos-Labs started a new project with the goal of investigating the possibilities of AI-assisted creation of large and detailed 3D landscapes. Although there were several commercial software tools capable of generating realistic elevation data, none of them were capable, to the best of our knowledge, of generating corresponding high-quality surface color information. As we aim r ed for realistic results, we wanted to take advantage of the infinite resources of realistic landscapes provided by our planet Earth. To process this vast amount of data, we explored deep learning models as they are conveniently specialized in learning inherent relations and patterns from their training dataset.

World Generator

The state of the art for generating virtual terrains was, and still is, using (lots of) procedural rules to build credible environments. Such systems can be complex to maintain and still require a lot of manual work. Our goal was to benefit from recent deep learning advances and have our neural network learn these rules without human help, at the cost of having less control over the results. This blog post will go through our numerous attempts to train a deep neural network to learn the intricate relationships of elevation and flora depending on the surrounding biomes while still allowing simple artist control over generated data.

Elevation, in computer graphics, is usually simplified as a single displacement value representing the altitude at each pixel. This allows storing black and white floating values in a texture that can subsequently be applied on a grid mesh of the same dimension (see Figure below). It is also referred to as digital elevation model (DEM, digital elevation model), which is the term usually used in geographic information systems (GIS, geographic information systems). As only a single displacement value is stored per pixel, this is considered only as two and half dimensions (2.5D) rather than three dimensions (3D) terrains. In other words, while we still capture the general elevation of the terrain, we are missing any recess in the geography of our map.

*Left: Color map, center: elevation map, right: example of 2.5D mesh displacement on the Horseshoe Bend location displayed on a grid mesh in Blender. Data are courtesy of the U.S. Geological Survey.*

Giving artistic control over a neural network is one of the most challenging aspects of machine learning. Ideally, we would want artists to give a rough sketch as input and still get detailed and realistic outputs.

Before talking about the model and its inputs, let us first look at the data we were able to gather.

Data Gathering

Training a neural network requires as much high quality reference data as possible, especially for a deep neural network (DNN, Deep Neural Network). It is well known that a neural network can only be as good as its training data.

Following a thorough investigation, of geographic information systems we decided to focus on reproducing satellite/aerial views. One remarkable and reliable source is the United States Geological Survey (USGS) which produces, provides and maintains geological and topographic maps of the USA. After reviewing multiple resources for gathering GIS data, we decided to use the United States Geological Survey (USGS). USGS has data for a wide range of resolutions, sometimes up to every meter, that can be fetched via their REST API. After setting up a few preprocessing scripts, we were able to gather captures of the USA landscapes. We randomly captured regions of 2km x 2km, 20km x 20km and 100km x 100km at 2k or 4k image resolution. As our priority was in learning diverse topologies, we focused on uneven terrain and mountainous regions rather than large patches of uniform landscape. Among all the possible choices given by USGS, we collected data from the WorldImagery, Elevation, LandCover and LandSurface databases.

LandCover is closely related to ecosystems (perennial snow/ice, forest, grassland, developed areas, crops, etc.), so we use these as an approximation for our biomes. It has 21 different labels (which we will later simplify to 7).

LandSurface gives topological descriptions related to the elevation (water, plain, hills, mountains, etc.). It has 10 different labels (which we will later simplify to 5).

WorldImagery and Elevation are our two “target” views, that is aerial and DEM.

*Captures from USGS showing different information. Respectively, World Imagery, Elevation, Landcover and LandSurface*

Capturing geological data comes in many forms with many difficulties along the way. For example, different topological maps might use different spatial references. We sometimes had to use the Geospatial Data Abstraction Library (GDAL for short), an open-source library, to warp different maps to the same spatial reference system and obtain pixel-perfect overlapping images. USGS and other agencies are changing their content and how to access them on a surprisingly regular basis and most of the time collecting a new batch of data meant updating our collection scripts to keep up with the latest changes.

Same geographical region using two different spatial references. Left: Epsg4326. Right: Epsg3857

Sadly, there are no international standards on metadata for topological maps, especially for categorizing ecosystems. Europe, for example, provides more detailed imagery and labels that are better suited to our needs than their USGS counterpart. However, the elevation data are less detailed, the LandCover is using a distinct set of categories, and there is no equivalent for the LandSurface database. If we were to train a successful model on the USGS data, it would be impossible to test it on other parts of world, or even on existing data from the Moon or other planets like Mars, which was one of our original ideas we wanted to try out.

Also note that urban areas are often captured in higher resolution and more often than unpopulated areas, which is inconvenient for our project because we wanted to focus on regions where nature is predominant.

GauGAN

At GTC 2019, Nvidia unveiled an amazing piece of technology called GauGAN, capable of generating impressively realistic images via Generative Adversarial Networks (GAN) from simple user-defined segmentation maps (labels-to-image). The biggest contribution of the accompanied research paper was the introduction of spatially-adaptive normalization (SPADE), “a conditional normalization layer […] that can effectively propagate semantic information throughout the neural network”. In other words, GauGAN creates images with contextual relationship between input labels. For example, a body of water would match the color of the sky and reflect its immediate surroundings.

Impressed by the results of this technology and particularly interested in the capabilities of SPADE to give us contextual links between input labels, we decided to use it as a base for our neural network.

While GauGAN is originally meant for label (N categorical dimensions) to color (3 dimensions) generation, we also wanted to integrate an additional elevation channel (4th dimension). Based on the captured data from USGS, we decided to use LandSurface as input label for elevation and LandCover for color input (N + M overlapping categorical dimensions input), with a separate loss function for the elevation and color outputs.

Data Preprocessing

Now that we had a more precise idea of what to do and how to do it, it was time to preprocess the data and train the model. Unlike available dataset on the web, we had to curate our own dataset which turned out to be a massive amount of work.

It is also worth pointing out that the reference model of GauGAN requires a lot of memory and a lot of time to train (several days on a single high-end GPU), not to even mention the additional cost of custom modifications we added to it. This resulted in exceptionally long iteration periods, and each modification and decision unfortunately took several days to test out. For brevity we will only describe our successful experimentations.

Early training experiments revealed that we should privilege far away satellite views for several reasons:

It is complex retrieving (public and free-to-use) data at high resolution.
Earth surface is so complex, no model could ever learn all the subtleties of nature.
Zooming out smooths out small details and only majorpredominant details remain.
Topographic agencies already processed zoomed out views to have less noise e.g., clouds, strong lighting & shadows from the sun etc…

After testing out a few color-only neural models like pix2pixHD and ProGAN (see figure below), we quickly moved on to training a full-scale GauGAN–based model also incorporating elevation data with appropriately preprocessed data from our dataset. We called this neural network WorldGen.

ProGAN’s output throughout its training with our data (aerial view only)

Model Training – Color

Thanks to a hyperparameter search, we were able to achieve satisfying results using Instance Normalization. Here are some of our first results:

Early results for the aerial view only with a single layer as input labels. Left: labels, middle: target reference, right: inference (generated image). Note on the top right a welcome side-effect: clouds are automatically gone; probably because they appear to the model as random artefacts that are not correlated with its inputs.

Here is another early experiment on a small European dataset consisting of only 10 different images. We randomly cropped regions of 256×256 pixels from HD images during training, applied data augmentation via random rotation and flipping to improve generalization. This is legitimate because captures are so zoomed out that we can safely ignore the fact that mountains, for example, may have different features depending on their geographic orientation. Here are inference results on the entire 2048×2048 image following this limited scope training:

*Trying to overfit a small training dataset from Corine Landsat. Left: reference image; right: generated image (overfitting on small dataset). © 2021 Copernicus Programme*

Now let us show inference results on untrained data, meaning parts of the globe that the model has never seen before and has not trained on:

*Examples of results inferred on previously unseen inputs.*

These surface color results were of surprisingly good quality considering our small training dataset, but there were obviously a lot of issues and imperfections in other cases, as shown in Figure N. On top of that, our first elevation results were unsatisfying, and we had to significantly adapt our model and iterate numerous times before reaching better quality results, as discussed in the next section.

*An example of failure case where our network produced visible artefacts, generating unrealistic patterns*

Model training – Elevation

A major part of our research was to add a fourth channel elevation data on top of the first three channels representing surface color. Training on DEM data required more time to converge than its color counterpart, probably because we were trying to learn all features at the same time. We finally were able to achieve pleasing results after many iterations and model tweaking. However, our elevation outputs often suffered from high variability resulting in very jagged mountain ranges and results only got better with very long trainings. Although it would be simple to apply a smoothing filter as a post-process, we choose to only show original outputs from the model to give a better outlook of our research methodology.

*Left: original elevation data, Right: inference from our DNN*

Data comes from the dataset we collected from USGS consisting of approximately 3000 high resolution images. LandSurface labels are superimposed.

*Elevation results on a high definition 2048 by 2048 map. LandSurface labels are superimposed. Rendered with ipyvolume.*

*Examples of final results with elevation and color inference from our WorldGen model, using ipyvolume.*

Generalization

We proved we could successfully train a GauGAN-like neural network to learn how to model surface color and elevation at the same time with strong correlations between these two components. Testing the trained model on untrained data gave us encouraging results, validating the first step of our project. Nonetheless, the required input labels to achieve such results are impressively complex, as shown in the following Figure.

*Real-life LandCover and LandSurface labels*

Synthesizing this kind of detailed input labels is extremely difficult, even for skilled artists. On top of that, labels would usually follow some geo-topographical correctness that artists might not be able to reproduce which would make our network unstable. Testing inference on naively simple labels did not meet our expectations and generated inferior quality results. In other words, our model was only able to generate data from places that already existed on earth; hand-drawn labels proved to be unnaturally too simplistic.

*Testing our trained model on naively drawn input labels gives mediocre results. Left: drawn input LandCover, middle: drawn input LandSurface, right: model inference (generated image)*

Putting more effort into trying to generate input labels that resemble nature improved results but still were quite different from the outputs we had on untrained data. Warning: beware, programmer art.

*Upper left: user drawn LandCover labels, Upper right: user drawn LandSurface labels. Bottom left: inferred surface color, bottom right: inferred elevation*

*3D view of generated data, using ipyvolume.*

Although talented artists continuously amaze us and could surely create real-life looking input labels, it would still require a lot of time, effort, and skills, which defeats our initial purpose of creating a simple to use tool. The tricks and modifications we explored to solve this problem will be the focus of a follow-up blog article.

Although it is difficult to draw new labels, it is quite simple to modify and tweak an existing topological map. Out of curiosity, we tried changing a few labels from an original map e.g., replacing a city by a forest or changing plains with mountains. We were happy to see that results were convincing, and this trick could be a way to do landscape inpainting.

When modifying pre-existing labels, the model can adapt and fill the gaps as a form of inpainting. The first row uses the real-world labels captured from the Golden Gate National Recreation Area. In the second row, we replace “Human Developed” areas by Forests. The third shows how we swapped Mountains with Hills.

Conclusion

WorldGen is a research project we worked on at Eidos–Montréal during 2019. The goal was to create a Machine Learning assisted tool to let users create realistic terrains thanks to a generative adversarial network based on NVIDIA’s GauGAN. We showed how we managed to achieve great results from aerial and topographical data. However, creating new inputs labels from scratch can be a challenging task and defeats our initial goal of creating a tool that is simple to use with no particular skill. We will further discuss this issue and our attempts to alleviate it in a future blog entry.

Authors

Thibault Lambert is a Senior R&D Graphics Engineer and joined Eidos–Montréal in 2018. He has worked in both visual effects and video games industries in the last 13 years and has been lucky enough to work on several well-received and renowned movies and AAA games. He is passionate about real-time graphics and is dedicated to bridging the gap between offline and real-time rendering as much as possible, bringing real-time technologies to the film industry and reproducing complex offline techniques in real-time for video games.

Ugo Louche joined Eidos-Montréal in 2019 as a Machine Learning Specialist. He obtained his PhD in Machine Learning in 2016 under the supervision of Pr. Liva Ralaivola where his research activities were mostly focused on Active Learning and Linear Classification. Ugo believes that Machine Learning has the potential to change how we make games, for the better, by allowing developers to focus on creative tasks and relegating menial work to AIs. He is committed to make this change happen and his work at Eidos-Montréal focuses on all aspects of Machine Learning that can help with game