Deep Learning for 3D Data

Introduction

There is no doubt that by now you have seen some semblance of deep learning in your day-to-day life. Whether in the form of some remarkable scientific breakthrough, or generating new and innovate selfie filters, the ubiquity of deep learning and its several manifestations are unavoidable. Pair this with an affection for video games, and you might be prone to wonder how such exciting and recent technologies could be applied to the gaming industry. While there are too many roads to tread for deep learning to sneak its way into video games, one exciting application is via computer graphics. And like a fork in the road, deep learning applications for computer graphics also come with several avenues, most of which can be indexed by the type of data that is being processed. This blog focuses on motivating geometric deep learning, the collection of efforts aimed at extending deep learning to irregularly structured data, which in our case is three-dimensional data structures, with an emphasis on mesh representations. The code used to build, train, and test the model in this article is publicly available on GitHub.

Three-dimensional data structures are tools-of-the-trade in computer graphics, appearing in an abundance of contexts and applications from manufacturing of industrial parts to video game production. These data structures often represent 3D shapes we find commonplace. Finding suitable computational representations for these shapes depends on both the context and the general nature of the shape. Moreover, the data-driven nature of deep learning often requires one to first build an intuition for the distinct types of data for which the tools therein are built. I discuss below some common three-dimensional data structures found in the video game industry. Following an overview of the common data representations, we will take a closer look at the mesh representation and some efforts within geometric deep learning to treat mesh representations.

Below are some common ways to represent three-dimensional data structures (refer to Figure [1] for a visualization on the different structures):

Point cloud – Scatter of points embedded in three-dimensional Euclidean space, usually sampled from some underlying structure.
Mesh – A polygon representation of a point cloud for whom the relation amongst the vertices is given by triplets/quadruplets of vertices (resp. triangles or quadrilaterals).
Voxel – Volumetric representation of a shape defined over a 3D grid. A voxel, a cell in the 3D grid, is “filled” if the voxel is inside the shape.
2D representing / Rendering – Representing your 3D object by taking “snapshots” or “renders” in the presence of external influences, e.g. light sources, and storing these as images. It is also common for each pixel in the image to also contain depth information.

Note that these data structures are specifically related to three-dimensional structures and do not, for example, include graph representations. Refer to this summary for a more thorough overview of geometric deep learning and a wider scope of the various data representations addressed.

**Figure 1**: Different representations of the same three-dimensional structure. (Source: Towards Data Science)

What is wrong with the everyday CNN?


(a)	(b)	(c)

Figure 2: (a) Mathematical formulation of 2D Convolutional Neural Network applied to a single pixel of an image (indexed as \( (u,v) \). Note that the weights of the kernel, \( \alpha _{ij} \), are independent of \( (u,v) \) (b) Visualization of grid-like structure used in the domain of the 2D CNN. The indices refer to the indices in the mathematical formulation. (c) Visualization of “sliding window” interpretation of 2D Convolutional Neural Networks (Source: Towards Data Science).

We begin by trying to understand why deep learning for meshes, and irregularly structured data in general, is singled out. To do so, it is necessary to understand some of the nuances underlying machinery used for more commonplace data, such as images. Convolutional Neural Networks (CNNs) have made waves for tasks like image classification and semantic segmentation of images. Processing an image with a convolution can be imagined via the “sliding window” interpretation, found in Figure [2]. The sliding window is known as a kernel or a filter, words often used interchangeably. The window, typically centered at a pixel, infers new information from the information of neighbouring pixels (colours) via some local operation (e.g., a weighted sum). The window then moves to an adjacent pixel and performs the same procedure. This process is repeated until the entire image has been covered. Processing an image this way, over several sequential applications of convolutions, comprises the backbone of many mainstream deep learning architectures, including those cited just previously. Refer to Figure [2a] for a mathematical expression of the kernel applied to a single pixel of an image.

You might be motivated to extend the same machinery for image-based tasks to meshes, but you would be hard-pressed off the bat. Taking a closer look at the expression above, you can see that it uses the fact that for any pixel (pad with zeros if you are on the boundary) you already know its neighbours. Loosely speaking, for a given pixel, the neighbouring pixels of any pixel can always be located as up, up-right, right, down-right, down, down-left, left, up-left (refer to Figure [2]). This grid structure is inherent to the image itself and is independent of the actual pixel values or location. Geometrically speaking, the local structure (e.g., curvature and normal vectors at a given pixel) of every image is identical, so it makes sense to use the same kernel for every pixel.

Now, going back to our irregularly structured 3D data, you might be able to see where such an expression renders itself a bit useless. Just a quick glance at any vertex in Figure [1], we see that the irregularity is attributed to the fact that the notion of neighbour is not obvious. Using the above image parlance, what constitutes the “up” or “up-right” vertex for a point in 3D? Adding some connectivity, e.g. faces, helps the situation a bit, since each vertex is accompanied by vertex-pairs corresponding to edges. However, in many cases the faces, and by consequence the edges, are affected by the underlying local structure: e.g. humanoid meshes, where there might be a finer tessellation around the eyes than around the forehead to better capture realistic details. In general, some vertices might: have more vertex-pairs (i.e., more edges) than others; have edges with varying lengths; have edges with varying angular distances; and so on. Addressing and incorporating these concerns into suitable operators analogous to convolutions for images constitutes a large part of the work that falls under the umbrella of geometric deep learning. This also serves as a good starting point for extending deep learning to meshes.

Building a geometrically motivated CNN


(a)	(b)

Figure 3: (a) Formulation of geometric convolutional neural network designed to serve as an analog to the 2D CNN. \( x_i \) is the vector of features for the \( i \)–th vertex, \( \alpha _{ij} \) is the weight assigned to each vertex-pair, and \( \theta \) is a matrix that transforms the number of input features to a desired number of output features. The dependence of \( \alpha _{ij} \) only on the vertices decouples the geometry from the features in this formulation. (b) Visualization of aggregation method used by the proposed CNN in Equation 2.

**Figure 4**: How \( \alpha _{ij} \) is calculated. Note that \( \alpha _{ij} \) depends only on the vertices. \( 𝒩(i) \) is the set of indices that correspond to the neighbors of vertex , including \( i \) itself. This set of neighbours is given by the faces that are incident to vertex \( i \).

From the above discussion, it seems that having a “one-size-fits-all” kernel, like those used for images, is not feasible for meshes. In this section, we will explore a simple mesh-based CNN, inspired by Kipf et al. whose edge, or kernel, weights are defined locally.Note that in Kipf et al., the proposed CNN is intended for graphs, but readily extends to meshes given that meshes are a type of graph.

Please refer to Figure [3a] and Figure [4] for an explicit formulation. Notice that we have separated the geometry of the underlying mesh, via the vertices, from the actual feature extraction. That is, the role of the sum in Figure [3a], using the weights \( \alpha _{ij} \), is to “gather” or “aggregate” information from the neighbours of a reference vertex¹ (refer to Figure [4]) in a manner proportionate to the Euclidean distance between the neighbours and the reference vertex. In this case, the further away a neighbouring vertex is, the less weight it is assigned in the aggregation. Moreover, the aggregation is done in a manner agnostic to the feature values. While this formulation is quite simple, it has two significant advantages: (1) it adapts to varying neighbourhood sizes since the sum is taken over all neighbours and normalized by the total length of edges; and (2) it adapts to varying edge lengths by using the edge lengths themselves to weight the neighbouring vertices. Moreover, given that the weights are normalized over the neighbourhood, the formulation remains local, whose edge weights are defined locally. Finally, once the neighbouring vertices have been aggregated, a filter \( \theta \) is applied to the aggregated vertex to transform the number of channels to the desired number of output channels.

Mesh Segmentation

The above CNN is used to build a model for mesh segmentation, the task of classifying each vertex on a mesh. Mesh segmentation is important for video game development, especially in the context of UV unfolding, which is the primary method by which one “paints” a mesh. UV unfolding, briefly, is the projection of a mesh onto a 2D plane, flattening it out while still maintaining its structure. However, when handling more elaborate meshes, it may be easier to first segment it into semantically distinct parts (e.g., for a humanoid mesh, segmentations can include arms, legs, etc.) and then project those separately.

To test our CNN and train our deep learning model, we use a subset of the COSEG dataset which comprises 3 different classes of meshes: tele-aliens, vases, and chairs. We use a deep learning architecture comprising 16 layers of the proposed CNN, each accompanied with a residual layer. The size of the meshes varies both within and between classes. In Figure [5], we see that the results fare well. Not bad for such a simple model that relies only on the CNN!

We also test our method on a harder dataset, namely the HumanSegmentation dataset which, as the name implies, involves the segmentation of humanoid meshes. In this case, the meshes have been segmented into 8 parts. While the results on this dataset are not as clean as before, it still offers reasonably good segmentations.

**Figure 5**: Results of segmentation model on COSEG dataset. Note that the performance on the tele-aliens class is not as good as the others.

**Figure 6**: Results of segmentation model on HumanSegmentation dataset.

Too good to be true?

The above results seem good. However, are they too good? In general, we would like our model to be resistant to changes in the data that do not really affect our perception of them. For example, in the case of semantic human segmentation above, stretching our meshes slightly in different directions does not affect our perception of them: we can still identify the legs, arms, etc. Does our model share this resilience? Figure [7] investigates this by taking a mesh from the HumanSegmentation dataset and stretching and shearing it (respectively). We see that the segmentation performed by the model deteriorates. However, upon a closer examination of Figure [4] which indicates how the weights are computed, we see that they indeed depend on vertex-pair distances. Moreover, the stretching and shearing transformations applied to the image do not preserve vertex-pair distances, thereby affecting the weights used by the CNN and altering the end results (refer again to Figure [2a]). On the other hand, transformations to the mesh that leave vertex-pair distances unchanged should be safe. This is particularly important, for example, for a set of meshes that are different poses of one mesh, e.g. different poses of a Non-Playable Character (NPC). In such a case, we would expect the vertex-pair distances to be the same whether the humanoid is sitting, standing, etc. so that the segmentation is left unchanged.

**Figure 7**: Results of segmentation model on HumanSegmentation dataset after transforming some of the meshes. The segmentation results are not resilient to the basic transformations since they do not preserve vertex-pair distances. Left: Original mesh. Middle: Mesh scaled unevenly along perpendicular directions. Right: Mesh sheared in along different directions.

In general, a model being adversely affected by subtle changes to the input is a persistent problem throughout deep learning, and geometric deep learning is no exception. As such, much work has been devoted to understanding how models are affected by such perturbations, and how we can remedy the situation. For example, a common and simple method is data augmentation which, in our case, could mean including the transformed meshes above into the original dataset. Conversely, one could tackle the problem from an architecture perspective and build a more robust CNN, which comprises a large chunk of the on-going work done within geometric deep learning (referring again to this survey). In any case, our example above treads the surface of the potential of geometric deep learning and its relevance to the gaming industry.

Conclusion

Extending the success of deep learning for image related tasks to three-dimensional data structures is not an easy task, but most certainly a tractable one. While the modern workhorse of deep learning, namely the convolutional neural network, does not naturally translate to the 3D domain, we have shown that you can still construct simple yet powerful analogs. Moreover, we have seen that their simplicity comes at the cost of some instability with basic perturbations of the input data. Nonetheless, great strides have been made in this vein, under the geometric deep learning moniker. The increasing prevalence of deep learning in tandem with the ubiquity of three-dimensional data makes the pairing of the two a natural one and might just push computer graphics to unprecedented levels.

Author

Bilal ABBASI is a researcher at Eidos-Sherbrooke. He obtained his PhD in 2018 in applied mathematics where his research focused on building numerical methods to solve geometrically motivated equations. In the final year of his PhD he joined the budding deep learning movement. However, despite his change in research direction, his work at Eidos–Sherbrooke has remained parallel to his affinity for geometry. That is, his research can be characterized by the data that he works with: three-dimensional structures. His research is at the intersection of geometry processing and deep learning, and he is a firm believer that this combination has the potential to push the boundary of computer graphics in video games to extraordinary levels.