Problem statement: “Learning texture (and shape) representations: Learn high-quality textures of 3D data to enable learning of probabilistic generative models for texturing unseen 3D models.”
AtlasNet has given a way to use latent representations of shapes to generate 3D objects from given meshes. It is also suggested such parametrization makes it easy to store texture information of surfaces. MeshCNN, a contemporaneous work, introduced CNN for meshes, analogous to CNN used in the image domain, where operations and features are defined on edges. DeepSDF is another that uses NNs to approximate the signed distance function to learn shape representation. These works have shown promising results on various downstream tasks like reconstruction, segmentation, synthesis, leveraging the powerful representations learned by their network. This motivates us to study if we can build a model that also takes in texture information (appearance) and draws relations between shapes and appearance.
Generative models in the image domain, specifically Image-to-image translation networks, have shown a way to learn a mapping from one domain A to another domain B in a conditional setting, where the generator learns the mapping combined with a single learned latent code or sampling from a distribution (multiple latent codes). [2,6,7] have shown impressive results in various downstream applications. These works have inspired us to adapt them for 3d data.
Broadly speaking, we hope to train a network that when given a 3D mesh as input is able to produce a realistically textured output. The ambiguity of this mapping (from a textureless 3D model to a textured 3D model) is distilled in a low dimensional latent vector, which at test time can be randomly sampled. Our goal is to learn such meaningful latent representations for appearance (simplified as texture) and shapes as well, such that the aforementioned task produces realistic and physically plausible texture(s).
As mentioned in the introduction, we were inspired by generative models in the image domain. As just one example, in “Toward Multimodal Image-to-Image Translation”, Zhu et. all created a deterministic generator, that when given an input image, is able to use stochastically sampled latent codes to produce a wide variety of outputs. Their network architecture consisted of a generator (composed of an encoder-decoder) and a discriminator. They explored a total of three different objective functions. In the first objective function, the ground truth image was encoded into the latent space, providing the generator with the ability to reconstruct the output image. In the second, a random latent vector was provided to the generator and the generator attempted to recover the latent vector from the output image. The third was a hybrid combination of the two. By providing both qualitative and quantitative evaluation criteria, they were able to show how their results compared, with respect to two main goals of realism and diversity. Their hybrid model achieved especially impressive results.
We have also included References at the end of this paper to keep track of all papers and outside information that we use and reference.
We plan to use data from the 3D Future Dataset , which was developed by professional designers and contains high-quality 3D instances of furniture with high-resolution informative textures. This dataset contains both synthetic scenes and 3D instances of furniture, but for our purposes, we will only be dealing with 3D instances of furniture, of which there are 9,992 instances in a total of 34 different categories. These 34 categories can be divided into broad categories, such as “bed”, “cabinet” or “chair”. As we are unsure of the power and versatility of our model, as well as taking into consideration data size and compute time, we will begin by only training and testing furniture whose category falls into the broader category of “sofa”. The diagram below, taken from , shows the distribution of the 9,992 instances of the 3D furniture data. The relevant information for our purposes is that there are a total of 5 different categories that are parts of the broader category of “sofa”: Three-Seat Sofa, Loveseat Sofa, L-Shaped Sofa, Lazy Sofa, and Chaise Longue Sofa. Within these categories, there are a total of 1588 instances, though we are currently unsure if we will be using a subset of these instances.
Each instance of the dataset contains 4 components: an rendered image (.jpg), a raw model (.obj), a normalized model (.obj), and corresponding UV texture (.png). While using this data, we plan to use the normalized model as opposed to the raw model to aid in the training of the model.
We will not have to perform much preprocessing on this data like re-meshing, subdivision, or simplification. We may have to manipulate the data so that each mesh has approximately the same number of edges. In the case of unrealistic runtimes, we may have to downsample the meshes, though we hope to avoid this we were drawn to the Future 3D dataset due to its intricate details.
We aim to develop a pipeline that can take a given mesh and fit a range of realistic textures from a latent texture space. This would involve training a function that takes as input a shape latent vector (representing the input mesh), a latent texture vector, and an xyz-point on the mesh. The output would be the texture's RGB color at that xyz-point on the input shape. We would train this model using the meshes and high-resolution textures of the 3D Future dataset. The shape latent vector will be extracted in one of two ways: 1) by sampling 3D points from the input mesh and applying PointNet; or 2) directly running MeshCNN on the input mesh. When training, the latent texture vector will be sampled from a Gaussian distribution. With the shape and texture latent vectors, the network learns a continuous function mapping xyz-points to the RGB colors of the texture. The model is trained with a generative adversarial network, where the discriminator compares a 2D image of the predicted texture against 2D images of same-class objects in the 3D Future dataset. Once the model is trained, we should be able to interpolate over the texture latent vector to produce varying textures for the input shape.
If we run into issues training the model with a generative adversarial network, we can instead try to extract the texture latent vector during training by passing a 2D image of the input object through an image encoder. The loss would then be calculated by directly comparing the predicted texture against the ground truth texture of the input object.
We can possibly extend this model by incorporating a sinusoidal representation network (eg, 'SIREN') to hopefully generate more accurate textures. SIREN's capacity to approximate high-frequency functions is a natural fit for the task of learning detailed textures.
- Explore latent space via latent space interpolation and see if our generated textures are satisfactory.
- Human evaluation.
Quantitative: It’s difficult to quantitatively evaluate generative models. However, we propose to use the following metrics:
- As our goal is to model realistic appearance, we can employ other learning models to “judge” the realism of our output i.e we can use a pre-trained segmentation network trained on the same object class to segment our model’s outputs. We also run this segmentation network on our training data. We can then use metrics from segmentation tasks like pixel accuracy and IoU.
- Perceptual Similarity: link
- SSIM: structure similarity image metric
Base goal: Build a training model that takes in 3D meshes and learns a function to predict texture information. (without thinking about accuracy) Target goal: Produce reasonable outputs on the test set and find out the encoder that works the best Stretch goal: Achieve same-level or higher accuracy on tests compared with other contemporary research
This problem clearly has an industry application to the development of video games, virtual reality, and movies. As these are highly commercialized products, the major stakeholders in this problem are the companies that put time, money, and resources into developing products that they believe will appeal to a certain audience. At the core, our project is about producing realistic output and the most obvious mistakes will be those that result in unrealistic textures for a certain rendering. If our algorithm was employed in a product and failed to deliver realistic outputs, users would most likely be dissatisfied. Video games, virtual reality, and movies function as a form of escapism, and an unrealistic visual input greatly detracts from this goal. A product that performs poorly would be harmful for the major stakeholders by costing time, money, and possibly company reputation.
One relevant consideration for our project is how culturally representative our training set is. As quotidian life becomes more virtual, the diversity of cultural representation within digital environments is increasingly important. In an application such as generating textures from a texture latent space, one should evaluate the training dataset and consider how representative the learned latent space really is. Popular methods for generating 3D environments could shape the aesthetics of our digital lives, perhaps with an unwitting bias towards particular cultures or socio-economic classes. This can prove isolating for many underrepresented communities. If generative approaches became the mainstay of digital design, it would be worthwhile to ensure that the learned latent spaces are truly representative of the diversity in our global, digital society.
Division of labor:
The major tasks split into data preprocessing, model construction, training, and testing in the time order. Specifically, they are:
- Preprocess the data from 3D Future
- Build the model learned from texture fields paper (and presumably also other related parts from models like MeshCNN)
- Train the model on our dataset
- Compare different shape encoders on their performances in training our dataset
- Test the result using our metrics
Tasks 1 and 2 can be approached parallelly. We plan to split the tasks by 1:3 or 2:2 depending on how long it takes for preprocessing. One possible plan is that Vikas will do task 1, and Angelina, Joshua, and Moyi will build the models (it is possible to have one more person working on preprocessing and members may also change).
After the first two parts, tasks 3 and 4 can be done parallelly. We will evenly split the training to use different encoders.
At last, for testing (task 5), we will split the test set and each will run a portion of it. The accuracy will be the average of our results.
(We will make a more refined plan soon and possibly change some details of the current rough plan.)
- Groueix, Thibault and Fisher, Matthew and Kim, Vladimir G. and Russell, Bryan and Aubry, Mathieu. AtlasNet: A Papier-Mache Approach to Learning 3D Surface Generation. Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018. arXiv: link
- Zhu, Jun-Yan, et al. "Toward multimodal image-to-image translation." Advances in neural information processing systems. 2017. link
- Hanocka, Rana and Hertz, Amir and Fish, Noa and Giryes, Raja and Fleishman, Shachar, and Cohen-OR Daniel. MeshCNN: A Network with an Edge. ACM Transactions of Graphics (TOG), 2019. link
- 3D-FUTURE: 3D FUrniture shape with TextURE: link
- Isola, Phillip, et al. "Image-to-image translation with conditional adversarial networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. link
- Semantic Image Synthesis with Spatially-Adaptive Normalization : link
- (Texture Fields) link
- (Deep Geometric Prior) link
- (Scene Representation Networks) link
- (Inferring Semantic Information) link
- (SynSin) link
- 3D Future link