Recreation of Pix2Vox using Tensorflow

Group Members

Hannah, Mustafa, Angela

Intro

We will be implementing the paper “Pix2Vox: Context-aware 3D Reconstruction from Single and Multi-view images” by Xie et al. This model recovers 3D representations of objects from single and multi-view images and aims to make improvements upon existing models that use RNN-based approaches. Pix2Vox alternatively uses an encoder-decoder to compute a set of features and recover the 3D shape of the object, a fusion model that aims to select high-quality reconstructions for each part of the object to later fuse together into the final result, and a refiner which aims to correct parts of the 3D volume that were recovered incorrectly. Altogether, this is a structured prediction problem.

Related Work

Summary

Similar to the Pix2Vox paper, the 3D R2N2 approach consists of the network performing single- and multi-view 3D reconstruction and employs an encoder-decoder method within its network. The 3 main components of the 3D R2N2 architecture are the encoder (a 2D CNN), the recurrence component (a 3D convolutional LSTM), and the decoder (a 3D deconvolutional neural network). The encoder first encodes each input image into a low-dimensional feature vector. The feature vector is then passed into the 3D convolutional LSTM. The main goal of this component is to allow the network to retain what it has seen and to update the memory when it sees and new image. The network has this ability due to its structure — it consists of 3D-LSTM units that are distributed in a grid structure. Each unit is responsible for a particular part of the final output, so when a new image contains information about a particular part different than the predicted reconstruction, the network will update the unit corresponding to that particular area. Finally, the 3D-LSTM passes the hidden state to the decoder which generates the probabilistic 3D voxel reconstruction of the object.

Public Implementations

Original Implementation

Pix2Vox++ (extension of Pix2Vox Paper) Code

Data

ShapeNet: This is a large-scale repository for 3D CAD models. This dataset contains over 300M models with over 220,000 objects classified into 3,135 classes arranged using WordNet.

ShapeNet Dataset

Pix3D: This dataset is a large-scale benchmark of diverse image-shape pairs with pixel-level 2D-3D alignment.

Pix3D Dataset

ScanObjectNN: This dataset has not been used in the paper and we will be trying our implementation on it. This is a newly published dataset that holds 2,902 3D objects in 15 categories. The categories are ‘bag’, ‘’box’, ‘desk’, ‘pillow’, ‘sofa’, ‘bed’, ‘table’, ‘cabinets’, ‘display’, ‘shelves’, ‘bin’, ‘chair’, ‘door’, ‘sink’, and ‘toilet.’ This is a large dataset, so it will require significant preprocessing.

ScanObjectNN

RBO: This dataset has not been used in the paper and we will be trying our implementation on it. This dataset models 14 articulated objects commonly found in the human environment and with RGB-D video sequences and wrenches recorded of human interactions with them.

Methodology

We aim to reconstruct the 3D shape of an object from either single or multiple RGB images. To do this, we will first use an encoder to produce a feature map from input images. Then the output from the encoder will go into a decoder; the decoder will use the feature maps to generate a coarse 3D volume correspondingly. Then, single or multiple 3D volumes will go through a context-aware fusion model. The context-aware fusion model will select high-quality reconstructions for each part from coarse 3D volumes to obtain a fused 3D volume. Lastly, a refiner with skip connections will refine the fused 3D volume to generate the final reconstruction.

We will use RGB images of 3D images from our ScanObjectNN dataset to train our model. We will implement our network in tensorflow and train both Pix2Vox-F and Pix2Vox-A using the adam optimizer. The initial learning rate will be set to 0.001 and will decay by 2 after 150 epochs. First, we will train both networks except the context-aware fusion feeding with a single-view image for 250 epochs. Then we will train the whole network jointly feeding with random numbers of input images for 100 epochs.

We believe that the hardest part of implementing this model will be training the model on our new dataset; we are using a large dataset and there may be a lot of preprocessing. Another thing we feel may be challenging is implementing this model using tensorflow.

We are trying our implementation on a completely new dataset called RBO We are also implementing this model using tensorflow.

Metrics

Experiments

Once we have the Pix2Vox-F implementation, we plan on evaluating its performance on synthetic images (the ShapeNet dataset) and on real-world images (the Pix3D dataset). Once we have the Pix2Vox-A implementation, we plan of doing the same (evaluating the model on synthetic and real-world images). Additionally, we plan on evaluating the model’s performance on unseen objects (with ShapeNetCore). Once we have these experiments completed (matching the results of the paper), we plan on evaluating both the Pix2Vox-F and Pix2Vox-A implementations on ScanObjectNN and the RBO Dataset of Articulated Objects and Interactions dataset.

Metric

In the paper, the authors use IoU (intersection over union) as a similarity measure in order to compare the predicted occupancy probability and the ground truth. The higher the IoU value, the better the reconstruction result. In this paper, the authors were hoping to find that their single-view and multiview reconstruction results were better than existing methods. In other words, they were hoping for a higher IoU value for their results compared to other methods.

Base Goal

Implementing Pix2Vox-F and replicating their results using the ShapeNet and Pix3D Datasets (the datasets used in their paper)

Target Goal

Implementing Pix2Vox-A and replicating their results using the ShapeNet and Pix3D Datasets (the datasets used in their paper)

Stretch Goal

Using the ScanObjectNN and the RBO Dataset of Articulated Objects and Interactions datasets with our implementation of Pix2Vox-F and Pix2Vox-A

Ethics

What broader societal issues are relevant in your chosen problem space?

One application that this method could be used for is damage assessment on buildings after earthquakes, as suggested by this paper. If the model produces accurate results, it has the ability to expedite the process of post-earthquake damage evaluation by serving as an alternative to manual evaluation and thus help in a more timely manner with disaster-relief work. On the other hand, however, if the model is not trained and evaluated properly, it could lead to inaccurate 3D reconstructions which could result in 1) a reconstruction containing damage that was not there, causing resources to be wasted on future construction projects, or 2) a reconstruction that misses major damage and therefore puts people’s safety at risk.

Why is Deep Learning a good approach to this problem?

The reason is different depending on the application, but for several applications, solving the problem through deep learning is beneficial. In the above example, timeliness is important so that resources can be allocated to fix the issues quickly in order to ensure people’s safety. Another application is medical image construction. In this case, deep learning can speed up the process of diagnosis and do so with more accurate results than human interpretation. This has the potential to increase safety for the patient.