Inspiration

NeRF reconstructions are local to the scene and are not generalizable to other scenes. I am trying to canonicalize NeRF reconstructions of single-object scene that enables viewing, annotating, and generating novel views from a canonicalized NeRF reconstruction.

What it does

Given a few NeRF models of un-aligned objects from the same category, my model can align all these objects into a canonical frame that enables viewing them from the same direction just by querying the NeRF models.

How we built it

We use equivariant neural networks built over Spherical Harmonic functions to learn pose-equivariant features from 3D volumes and then compute their dot-product to obtain a pose-invariant canonical feature that enables processing the NeRF reconstruction.

Challenges we ran into

Training NeRF models consumes data as well as NeRF reconstructions are often noisy and canonicalizing them is challenging.

Accomplishments that we're proud of

Aligned object models with consistent part segmentations.

What we learned

We learned to create equivariant implicit neural networks and canonicalize them for 3D pose.

What's next for Canonical Neural Fields

Using it on real-world objects and performing other robotic tasks using it that enable better object understanding. Also making it work for multi-object scenes is an important future work.

Check-in 2

Title

Canonical Neural Fields

Author

Rahul Sajnani

Introduction

Neural Fields are becoming increasingly popular to represent indoor scenes that capture its reconstruction and enable estimation of novel views. Neural Fields -- storing fields in a neural network that can be queried at arbitrary spatial locations or time steps. We (referring to I as we) estimate pose-invariant features from neural fields that can be leveraged to further perform multiple different tasks such as pose-estimation, canonicalization, and segmentation.

Related Work

Our work is closely related to ConDor 1, but we canonicalize density fields instead of point clouds. Canonicalizing density fields enables capturing features for any queried point in space and allows estimating motion-invariant features that can be leveraged to perform further tasks over the field, for instance pose-estimation, manipulation, and editing of objects within fields. We further delve into additional related work that are important to this work.

Recent research has shown that self/weak supervision is sufficient for learning pose canonicalization on point clouds 1, 2, 3, 2D key-points4, and images 5, 6. None of these previous self-supervised methods can operate directly on neural fields -- to the best of our knowledge, ours is the first.

Data

The input to my method is a voxel density grid that is obtained by sampling a pre-trained Neural Radiance Field. We train NeRF fields using TensoRF that accelerates scene reconstruction. Each scene is viewed from 54 cameras and contains a single object at the origin. The rotated voxel density fields are input to our canonicalizer. We predict a point-wise pose-invariant embedding and rotation-equivariant transformation that canonicalize the input density field.

Methodology

Given a uniformly sampled density field from a pre-trained NeRF model, we compute rotation equivariant convolutional features using Tensor Field Networks that are built using Spherical Harmonic functions. We compute two such rotation equivariant features and compute their dot product to obtain canonical pose-invariant features. We can use this canonical pose-invariant features to compute canonical coordinates for each point in space and obtain a rotation-equivariant transformation matrix that canonicalize the input density field to the canonical frame.

Metrics

We measure the performance of our canonicalizer on three metrics introduced in ConDor 1. 1, introduce the following metrics to measure canonicalization: (1) Ground Truth Consistency – measures canonicalization between ground truth canonicalization and predicted canonicalization, (2) Instance-Level Consistency – measures canonicalization between rotated poses of the same instance, and (3) Category-Level Consistency – measures canonicalization between different instances of canonicalized shapes within the same category. Out of these, we do not evaluate the Ground Truth Consistency metric as it has a degeneracy wherein this metric is minimized if a canonicalizer predicts identity.

We introduce a new canonicalization metric, the Ground Truth Equivariance Consistency (GEC) that measures canonicalization performance against ground truth canonicalized shapes and does not have the degeneracy as in Ground Truth Consistency metric of 1. The final submission will contain evaluation of our canonicalization performance on the (1) Instance-Level Consistency (IC), (2) Category-Level Consistency, and (3) Ground Truth Equivariance Consistency (GEC) and compare it against canonicalization performance of 1 to benchmark our field canonicalization.

Ethics

Who are the major stakeholders of this problem?

A broader goal of this project is to infer the scene and object positions from a neural field. Robots nowadays use some variants of NeRF for inferring their surroundings and perform actions on these scenes. For instance, if I ask the robot to move a cup, it must first understand where the cup is located (pose-equivariant information) and act accordingly. In cases where this algorithm fails to estimate the appropriate canonical pose or a pose-invariant embedding, the robot will perform the task incorrectly which may not be aligned with the end user’s needs (who are the stakeholders).

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

We use a synthetic dataset that contains densities sampled from nerf models trained over blender generated data. This data does not entirely align with the real-world as it does not replicate noise due to improper pose-estimation, camera distortions, etc. The only noise in this dataset is due to nerf reconstruction which is more aligned to the real-world, but does not replicate all the noises in real scenes

Algorithmic issue

Our method samples a uniform density field and then uses it to canonicalize the field. However, it would be more powerful if we can canonicalize fields by sampling randomly across the entire scene. This allows feature aggregation at arbitrary locations in space instead of interpolating between the voxel grids.

Built With

Share this project:

Updates