Introduction:

Problem statement: “Learning texture (and shape) representations: Learn high-quality textures of 3D data to enable learning of probabilistic generative models for texturing unseen 3D models.”

AtlasNet has given a way to use latent representations of shapes to generate 3D objects from given meshes. It is also suggested such parametrization makes it easy to store texture information of surfaces. MeshCNN, a contemporaneous work, introduced CNN for meshes, analogous to CNN used in the image domain, where operations and features are defined on edges. DeepSDF is another that uses NNs to approximate the signed distance function to learn shape representation. These works have shown promising results on various downstream tasks like reconstruction, segmentation, synthesis, leveraging the powerful representations learned by their network. This motivates us to study if we can build a model that also takes in texture information (appearance) and draws relations between shapes and appearance.

Generative models in the image domain, specifically Image-to-image translation networks, have shown a way to learn a mapping from one domain A to another domain B in a conditional setting, where the generator learns the mapping combined with a single learned latent code or sampling from a distribution (multiple latent codes). [2,6,7] have shown impressive results in various downstream applications. These works have inspired us to adapt them for 3d data.

Broadly speaking, we hope to train a network that when given a 3D mesh as input is able to produce a realistically textured output. The ambiguity of this mapping (from a textureless 3D model to a textured 3D model) is distilled in a low dimensional latent vector, which at test time can be randomly sampled. Our goal is to learn such meaningful latent representations for appearance (simplified as texture) and shapes as well, such that the aforementioned task produces realistic and physically plausible texture(s).

Related Work:

As mentioned in the introduction, we were inspired by generative models in the image domain. As just one example, in “Toward Multimodal Image-to-Image Translation”, Zhu et. all created a deterministic generator, that when given an input image, is able to use stochastically sampled latent codes to produce a wide variety of outputs. Their network architecture consisted of a generator (composed of an encoder-decoder) and a discriminator. They explored a total of three different objective functions. In the first objective function, the ground truth image was encoded into the latent space, providing the generator with the ability to reconstruct the output image. In the second, a random latent vector was provided to the generator and the generator attempted to recover the latent vector from the output image. The third was a hybrid combination of the two. By providing both qualitative and quantitative evaluation criteria, they were able to show how their results compared, with respect to two main goals of realism and diversity. Their hybrid model achieved especially impressive results.

A living list of current implementations: link link link link link

We have also included References at the end of this paper to keep track of all papers and outside information that we use and reference.

Data:

We plan to use data from the 3D Future Dataset [4], which was developed by professional designers and contains high-quality 3D instances of furniture with high-resolution informative textures. This dataset contains both synthetic scenes and 3D instances of furniture, but for our purposes, we will only be dealing with 3D instances of furniture, of which there are 9,992 instances in a total of 34 different categories. These 34 categories can be divided into broad categories, such as “bed”, “cabinet” or “chair”. As we are unsure of the power and versatility of our model, as well as taking into consideration data size and compute time, we will begin by only training and testing furniture whose category falls into the broader category of “sofa”. The diagram below, taken from [5], shows the distribution of the 9,992 instances of the 3D furniture data. The relevant information for our purposes is that there are a total of 5 different categories that are parts of the broader category of “sofa”: Three-Seat Sofa, Loveseat Sofa, L-Shaped Sofa, Lazy Sofa, and Chaise Longue Sofa. Within these categories, there are a total of 1588 instances, though we are currently unsure if we will be using a subset of these instances.

Alt text

Each instance of the dataset contains 4 components: an rendered image (.jpg), a raw model (.obj), a normalized model (.obj), and corresponding UV texture (.png). While using this data, we plan to use the normalized model as opposed to the raw model to aid in the training of the model.

We will not have to perform much preprocessing on this data like re-meshing, subdivision, or simplification. We may have to manipulate the data so that each mesh has approximately the same number of edges. In the case of unrealistic runtimes, we may have to downsample the meshes, though we hope to avoid this we were drawn to the Future 3D dataset due to its intricate details. Alt text

Methodology:

We aim to develop a pipeline that can take a given mesh and fit a range of realistic textures from a latent texture space. This would involve training a function that takes as input a shape latent vector (representing the input mesh), a latent texture vector, and an xyz-point on the mesh. The output would be the texture's RGB color at that xyz-point on the input shape. We would train this model using the meshes and high-resolution textures of the 3D Future dataset. The shape latent vector will be extracted in one of two ways: 1) by sampling 3D points from the input mesh and applying PointNet; or 2) directly running MeshCNN on the input mesh. When training, the latent texture vector will be sampled from a Gaussian distribution. With the shape and texture latent vectors, the network learns a continuous function mapping xyz-points to the RGB colors of the texture. The model is trained with a generative adversarial network, where the discriminator compares a 2D image of the predicted texture against 2D images of same-class objects in the 3D Future dataset. Once the model is trained, we should be able to interpolate over the texture latent vector to produce varying textures for the input shape.

If we run into issues training the model with a generative adversarial network, we can instead try to extract the texture latent vector during training by passing a 2D image of the input object through an image encoder. The loss would then be calculated by directly comparing the predicted texture against the ground truth texture of the input object.

We can possibly extend this model by incorporating a sinusoidal representation network (eg, 'SIREN') to hopefully generate more accurate textures. SIREN's capacity to approximate high-frequency functions is a natural fit for the task of learning detailed textures.
Pipeline from Texture Fields Paper

Metrics:

Qualitative:

  1. Explore latent space via latent space interpolation and see if our generated textures are satisfactory.
  2. Human evaluation.

Quantitative: It’s difficult to quantitatively evaluate generative models. However, we propose to use the following metrics:

  1. As our goal is to model realistic appearance, we can employ other learning models to “judge” the realism of our output i.e we can use a pre-trained segmentation network trained on the same object class to segment our model’s outputs. We also run this segmentation network on our training data. We can then use metrics from segmentation tasks like pixel accuracy and IoU.
  2. Perceptual Similarity: link
  3. SSIM: structure similarity image metric
  4. FID

Base goal: Build a training model that takes in 3D meshes and learns a function to predict texture information. (without thinking about accuracy) Target goal: Produce reasonable outputs on the test set and find out the encoder that works the best Stretch goal: Achieve same-level or higher accuracy on tests compared with other contemporary research

Ethics:

This problem clearly has an industry application to the development of video games, virtual reality, and movies. As these are highly commercialized products, the major stakeholders in this problem are the companies that put time, money, and resources into developing products that they believe will appeal to a certain audience. At the core, our project is about producing realistic output and the most obvious mistakes will be those that result in unrealistic textures for a certain rendering. If our algorithm was employed in a product and failed to deliver realistic outputs, users would most likely be dissatisfied. Video games, virtual reality, and movies function as a form of escapism, and an unrealistic visual input greatly detracts from this goal. A product that performs poorly would be harmful for the major stakeholders by costing time, money, and possibly company reputation.

One relevant consideration for our project is how culturally representative our training set is. As quotidian life becomes more virtual, the diversity of cultural representation within digital environments is increasingly important. In an application such as generating textures from a texture latent space, one should evaluate the training dataset and consider how representative the learned latent space really is. Popular methods for generating 3D environments could shape the aesthetics of our digital lives, perhaps with an unwitting bias towards particular cultures or socio-economic classes. This can prove isolating for many underrepresented communities. If generative approaches became the mainstay of digital design, it would be worthwhile to ensure that the learned latent spaces are truly representative of the diversity in our global, digital society.

Division of labor:

The major tasks split into data preprocessing, model construction, training, and testing in the time order. Specifically, they are:

  1. Preprocess the data from 3D Future
  2. Build the model learned from texture fields paper (and presumably also other related parts from models like MeshCNN)
  3. Train the model on our dataset
  4. Compare different shape encoders on their performances in training our dataset
  5. Test the result using our metrics

Tasks 1 and 2 can be approached parallelly. We plan to split the tasks by 1:3 or 2:2 depending on how long it takes for preprocessing. One possible plan is that Vikas will do task 1, and Angelina, Joshua, and Moyi will build the models (it is possible to have one more person working on preprocessing and members may also change).

After the first two parts, tasks 3 and 4 can be done parallelly. We will evenly split the training to use different encoders.

At last, for testing (task 5), we will split the test set and each will run a portion of it. The accuracy will be the average of our results.

(We will make a more refined plan soon and possibly change some details of the current rough plan.)

Relevant Works/Resources:

  1. Groueix, Thibault and Fisher, Matthew and Kim, Vladimir G. and Russell, Bryan and Aubry, Mathieu. AtlasNet: A Papier-Mache Approach to Learning 3D Surface Generation. Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018. arXiv: link
  2. Zhu, Jun-Yan, et al. "Toward multimodal image-to-image translation." Advances in neural information processing systems. 2017. link
  3. Hanocka, Rana and Hertz, Amir and Fish, Noa and Giryes, Raja and Fleishman, Shachar, and Cohen-OR Daniel. MeshCNN: A Network with an Edge. ACM Transactions of Graphics (TOG), 2019. link
  4. 3D-FUTURE: 3D FUrniture shape with TextURE: link
  5. link
  6. Isola, Phillip, et al. "Image-to-image translation with conditional adversarial networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. link
  7. Semantic Image Synthesis with Spatially-Adaptive Normalization : link
  8. (Texture Fields) link
  9. (Deep Geometric Prior) link
  10. (Scene Representation Networks) link
  11. (Inferring Semantic Information) link
  12. (SynSin) link
  13. 3D Future link

Built With

Share this project:

Updates

posted an update

Introduction

Problem statement: “Learning texture (and shape) representations: Learn high-quality textures of 3D data to enable learning of probabilistic generative models for texturing unseen 3D shapes.”

Recent works on 3D shape representations (like AtlasNet, Mesh CNN, and DeepSDF) and texture mappings on 2D images (like image-to-image translation conditional adversarial networks) inspired us to build a model that learns texture information (appearance) and draws relations between shapes and appearance in 3D space. In our background research, we found a related work, Texture Fields, from 2019 that learns a parameterized continuous function for representing texture information in 3D space. Our project is going to be based on this work and train a network that produces more meaningful latent representations for appearance and shapes. Such representation, when randomly sampled at test time, should produce more physically plausible texture(s) with higher quality. In our project, we plan to incorporate more state-of-the-art techniques (like DeepSDF and SIREN) and test on efficiency and output quality for different models using training data from 3D future and ShapeNet.

Challenges

The base of our project is the Texture Fields code and architecture, and although we plan on making various structural changes with the intent of improved performance, our first step was making sure we were able to run their code as it is. Specifically, we wanted to make sure each of us individually had the code set up and running somewhere we had access. While the authors of Texture Fields made their code publicly available, getting the code running locally has been a challenge and more of a time-consuming process than previously anticipated.

One upcoming challenge is to figure out how we store and share the data. The training data provided by Texture Fields contains only the car category which already has 33GB. The file size is cumbersome for our local machines, and clearly, we will need to shift to GCP to train our model. As no member of our group has prior experience with GCP, we anticipate a learning curve as we determine the best way to run our project.

Insights

The demo provided by Texture Fields is able to run on our local machines and we have gotten both the conditional and unconditional generator to work. The output contains rendered images from different viewpoints. However, the current model cannot reproduce finer details with high quality. This is an area that we want to further improve on. We also realize that the paper’s implementation does not include sampling point clouds from the input textures, so this will need to be incorporated into our data-preprocessing.

Plan

Understanding the Texture Field code has taken longer than we previously anticipated, and we discovered that we have more preprocessing steps necessary than previously anticipated. The Texture Field architecture first preprocesses each mesh into a point cloud, but their publicly available code does not include the preprocessing files. As such, we now realize that we will have to devote time preprocessing the 3D Future data so that it is compatible with Texture Field. We are a little bit behind where we had hoped to be at this point, but have a plan for moving forward. Finding the preprocessing scripts from ShapeNet will help speed this up.

There are two main tasks for our team at the current stage, preprocessing and building the model. One group will preprocess the 3D meshes into point clouds, depth maps, rendered images to be later fed into training, and the other group will dive into the source code (Texture Fields, DeepSDF, SIREN), understand the detailed structures, and transport the model to GCP. Then we will switch out the shape and image encoders to test if other newer methods can learn more meaningful parameters. Finally, we will attempt to train our model on a high-quality dataset of 3D textures to explore how the model performs with higher-frequency input.

Log in or sign up for Devpost to join the conversation.