SNeRFs Up

Table 1

Who: Snerfs up!

Sylvie Bartusek (sbartuse) Zihan Zhu (zzhu92) Charlotte Introcaso (cintroca) Emre Arslan (earslan)

Introduction:

We aim to implement the 2020 paper “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” by authors Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. The paper presents a supervised learning method for visual computing in which photorealistic 3D novel views are synthesized from any angle of a scene with complex geometry and appearance given a small set of input images. The method optimizes a continuous volumetric scene function in a multilayer perceptron that inputs a 5D coordinate of position and viewing direction and outputs volume density and color. The weights of the neural network are a volumetric representation of the scene. We chose this topic because some members of our group are pursuing the visual computing pathway and expressed interest in NeRFs.

Related Work:

Neural Volumes (NV) https://arxiv.org/abs/1906.07751

This paper presents a learning-based approach to representing dynamic objects and scenes by the integral projection model using 2D images from multi-view capture. This method is able to model real-world scenes that contain complex occlusions, reflectance variability and topological evolution. The method has two parts: an encoder/decoder network and a ray-marching component. The encoder/decoder network takes input images of a dynamic scene, creates a latent representation, and outputs a 3D novel volume representation. The encoder inputs images from multiple cameras into camera-specific CNNs to obtain the latent representation. This representation supports conditional decoding in which only a portion of the scene’s state is modified. The ray-marching algorithm is differentiable and renders an image from the 3D volume representation given a particular point of view. The model trains the weights of the encoder/decoder network and minimizes the squared pixel reconstruction loss.

Implementations

link
link

Data:

The original paper provides two kinds of datasets:

Datasets of synthetic renderings of objects: – Deepvoxels: Learning persistent 3D feature embeddings (link): DeepVoxels contains four Lambertian objects with simple geometry. Each object is rendered at 512 × 512 pixels from viewpoints sampled on the upper hemisphere (479 as input and 1000 for testing). (Deepvoxels also has other real scenes data but does not use in nerf paper) – Their own datasets (link ): It contains pathtraced images of eight objects that exhibit complicated geometry and realistic non-Lambertian materials. Six are rendered from viewpoints sampled on the upper hemisphere, and two are rendered from viewpoints sampled on a full sphere. We render 100 views of each scene as input and 200 for testing, all at 800 × 800 pixels.
Real images of complex scenes, which consists of 8 scenes captured with a handheld cellphone. – 5 taken from the LLFF paper ( Local light field fusion: Practical view synthesis with prescriptive sampling guidelines, link ). – 3 capture by themselves (link).

If we want to experiment on more datasets, we can consider the datasets from 3d gaussian splatting paper. These data are real images: link

The data size varies, depending on how many images the dataset has. And each image’s size depends on the resolution. Each image size is from 100 kb to 400 kb. There is no significant preprocessing, we can directly load the images. One thing may need to be considered is that for synthetic images, there are ground truth camera poses and intrinsic parameters, and scene bounds but for real scene images, we don’t have them. As suggested in the paper, we can use the COLMAP structure-from-motion package to estimate these parameters (Structure-from-motion revisited: link / link ).

Methodology:

To learn scenes and synthesize new views, the paper uses a multilayer perceptron (MLP). The MLP is trained via images to create an implicit representation of the scene. Given a position in space and a viewpoint, the network returns what is seen from that angle as RGB and volume density values. To train it, a number of 2D images (with known camera positions) representing the scene are used, and an MLP is essentially overfit to that particular scene. For each pixel in the images, we shoot rays into the scene, sample along that ray, feed those sampled points to the network, and use the returned information to complete volumetric rendering. 
Though the network architecture is a simple MLP, it will be a challenge to implement the computer graphics aspects and the optimizations provided in the paper, such as training two networks with different numbers of samples to increase training speed. Also, even with optimizations, training the network will be hard as this paper is the first iteration of neural radiance fields. It will take quite a long time if we want to represent scenes with high resolution and detail.

Metrics:

As seen in Table 1 from the paper, NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (the table can be found at the top of this writeup), we see three metrics used to evaluate the performance of different top-performing techniques for view synthesis: – Peak signal-to-noise ratio (PSNR) – Structural Similarity Index (SSIM) – Learned Perceptual Image Patch Similarity (LPIPS)

We do not have to use all 3 to evaluate our model’s performance. Switching between them as we go and finding the metric that gives us the best results is most likely what will happen. Additionally, the metrics all perform differently according to the input, so we will probably switch between metrics

Other features:

“We find that the basic implementation of optimizing a neural radiance field representation for a complex scene does not converge to a sufficiently high resolution representation and is inefficient in the required number of samples per camera ray. We address these issues by transforming input 5D coordinates with a positional encoding that enables the MLP to represent higher frequency functions, and we propose a hierarchical sampling procedure to reduce the number of queries required to adequately sample this high-frequency scene representation.” – Base features: none – Target: positional encoding – Added features/optimizations (stretch): Hierarchical Sampling, View Dependence

Ethics:

How are you planning to quantify or measure error or success? What implications does your quantification have?

Since we are creating scenes and images, there is a degree of subjectivity in assessing our performance. Even though there are quantitative methods like PSNR scores, they might not accurately reflect image quality. Human eyes are naturally a great way to see if an image looks high quality, if it has artifacts etc. but if humans are judging the results, different people might have different opinions. This means that a research paper author could argue that they found novel results if their PSNR (or other measurement) score is high while images do not exactly look great, or if they have high PSNR due to some form of image manipulation or selection. Similar arguments could be made with only human observations, claiming that rendered images look great even if the PSNR score is low, and that cannot be objectively assessed due to the possibility of different people having different perceptions/opinions. Overall, the measurements need to be chosen carefully and research must be criticized accordingly.

What broader social issues are relevant to your chosen problem space?

link link Like any image generating model, NeRFs are subject to the risk of misuse. As NeRF related technology advances, it might be possible for people to easily capture 2D images of a scene on their cell phones and synthesize 3D novel views that could be altered to create fake content for the purposes of spreading false information. For example, research at Adobe showed how NeRFs could be further developed to be a method for generating deepfakes. Unlike other deepfake technology that superimposes onto an existing image, RigNeRF synthesizes an entirely new scene based on volumetric neural rendering. It is able to separate poses and expressions in synthesizing novel views. With further advancements, RigNeRF could lead to the development of full-body deepfakes with rendered movement, poses, and expressions.

Division of Labor:

– Loading data & preprocessing -> Charlotte – Building the model -> Sylvie – Volumetric rendering -> Emre – Metrics and loss implementation and test -> Zihan – Poster (everyone)