Geodesic Flow: An Incremental Approach to Dense Semantic Correspondence

Cole Foster: cfoste18

Links:

Introduction

Geodesic Flow is a continuation of work done by Berk Sevilmis within the Laboratory of Engineering Man and Machine Systems (LEMS) in the Department of Electrical and Computer Engineering at Brown University.

The dense correspondence task involves two images, a source and a target, and producing a flow field such that each pixel in the source points to its matching pixel in the target. Semantic correspondences involve two images of the same class, where the goal is to align the two semantic objects.

Deep-Learning Based Approaches are currently dominating the field as they are able to learn semantic features. This model aims to improve any existing Dense Correspondence method by providing an series of intermediate images that can be used to incrementally warp the source image to the target image. For this, we construct a Latent Image Manifold by organizing a dataset of images, and we return the geodesic path as the shortest path between two images along the manifold.

In order to construct our manifold, we need a way to measure the similarity of two images. For this, we train an Image Embedding Network. We leverage the features extracted by SFNet to provide semantic awareness, and we add additional convolution and fully-connected layers to reduce the image to a 1024D embedding. The Euclidean distance between these embeddings can then be used as a metric of similarity between the images they were extracted from.

With the Image Embeddings, we construct the Latent Image manifold over the SPair-71k Train/Validation dataset. We extract the embeddings from each image, and create the manifold using the Relative Neighborhood Graph. Finally, given a source and target image, we can return the geodesic path by finding their nearest neighbors in the manifold and computing the shortest path along the manifold. This path can then be returned, and the images can incrementally warped along the path to compute the Geodesic Flow.

Related Work

The Dense Correspondence problem is a popular field, with some state of the art methods including Transformers (CATs, 2021) and Graph Matching (Deep Graph Matching via Blackbox Differentiation of Combinatorial Solvers, 2020).

SFNet is a slightly older CNN-based approach to the dense semantic correspondence problem. It uses ResNet-101 feature extraction and trains adaption layers by utilizing binary foreground masks. These masks help enforce loss only on the semantic objects, allowing SFNet to learn semantic features. SFNet further creates a 20x20x20x20 correlation map that is matched to produce a 20x20 flow field. This flow field is bilinearly interpolated to return the final flow field. Github: https://github.com/cvlab-yonsei/SFNet

This project uses SFNet for a variety of purposes:

  • Feature extraction for the Embedding Network
  • Its loss term is used to train the EmbeddingNetwork
  • Pretrained SFNet is used for the incremental warps

Data

  • VOC2012. SFNet was trained on the segmentation set from VOC2012. It provides 2,791 images (training and validation) with segmentations. These segmentations are used to produce the binary foreground masks. This data is also used to train the Embedding Network.
  • SPair-71k. This is a popular dataset used for evaluating dense semantic correspondences. It contains 1,800 images, and the 1304 training/validation images are used to construct the Latent Image Manifold.
  • PF-Pascal. This paper introduces the PF-PASCAL dataset for dense semantic correspondence evaulation. This dataset is used to evaluate our methods accuracy by PCK (percentage of correct keypoints).

Methodology

Image Embedding Network

To provide a measure of image similarity, we decided to produce an Image Embedding Network (creatively named EmbeddingNet). The similarity (or dissimilarity) of two images can be measured by the Euclidean distance between the embeddings. When approaching the architecture of this EmbeddingNet, we wanted the desired characteristics:

  • A 1024D embedding vector is produced.
  • This embedding is dependent on semantic features, not background noise.

EmbeddingNet was trained in a potentially novel way. Typically, embedding networks are trained by Triplet Loss.T his requires specifying three images, two of the same class (anchor and positive) and one of a different class (negative). The loss function enforces that the distance between the anchor and positive embedding is less than the distance between the anchor and negative embedding.

However, our semantic flow problem inherently deals with images of the same class. Thus, training the embedding network is not a straightforward approach. The following solution was proposed:

  • Take three images of the same class. Choose one image as the anchor image.
  • The other two images are the positive and negative images, but the assignment is unknown
  • Use an existing image similarity metric to decide which of the two images is most similar to the anchor image. This image becomes the positive image. The later becomes the negative -Use triplet loss on these three images.

In this project, pretrained SFNet was used to provide a metric of image similarity. In SFNet training, binary foreground masks are used to loss from a correspondence. Here, the intuition is that the lower the loss, the better the warp, and thus the more similar the images are. This was extended to measure the similarity of anchor to positive and anchor to negative.

Latent Image Manifold

Once the image embeddings are extracted, they can be compared by Euclidean distance, or the 2-norm. The nearest neighbor of a query image can be returned as the image whose embedding is closest to the query embedding.

The latent image manifold was constructed over 1304 images from the SPair-71k dataset. Embeddings were extracted from each image, and each embedding acts as a node in the 1024D latent space. The Relative Neighborhood Graph was used to construct the manifold.

Geodesic Flow

Since the Relative Neighborhood Graph is a connected graph, we can always define shortest path between two embeddings on the manifold. Given a source and target image, we find their nearest neighbors on the manifold and return the shortest path between them as the geodesic path.

We directly warp consecutive images by SFNet (or any method) on the path to produce a series of incremental warps, which are combined (in order) to produce the final flow field. This flow is called the Geodesic Flow.

Metrics

The metric used to measure success in dense semantic correspondence is Percentage of Correct Keypoints (PCK). In the test set, hand annotations are used to identify semantic keypoints in the images. Then, the keypoints of the warped source image is compared to the keypoints of the target image. PCK measures the correctness of keypoints up to a certain accuracy.

Ideally, success of this project would show higher PCK for the datasets of PF-PASCAL, PF-WILLOW, and SPAIR-71k when using geodesic flow compared to not using it. We directly compare SFNet with and without geodesic flow.

Ethics

Deep Learning is a good solution to the problem of producing an image embedding because models can use learned semantic features by going through lots of semantic images. Image embeddings are a low risk area, as they are mostly used for retrieval. However, our work in word embeddings has shown that bias can exist in embeddings. This bias could show in the geodesic paths between images.

We measure the success of this algorithm by PCK. Success in this project requires performing better than regular SFNet within the metric of PCK since we are using SFNet as a base. Luckily, this topic of semantic correspondence is low risk, so the implications of error or success do not extend far.

Division of Labor

Cole

  • Creating EmbeddingNet Architecture.
  • Training EmbeddingNet based on Triplet Loss Function.

Cole Foster

  • Manifold Creation using EmbeddingNet on SPair-71k
  • Full Pipeline from dataset to embeddings to manifold, and Saving/Loading it.
  • Nearest Neighbor Search and Geodesic Path Along Images

Cole Riley Foster

  • Incremental Warping via Geodesic Path with normal SFNet
  • Parent Geodesic Flow Class to handle NNS, Geodesic Path, and Incremental Warp
  • Evaluation on PF-PASCAL

Built With

Share this project:

Updates

posted an update

Introduction

The purpose of the dense correspondence problem is to create a dense vector field from each pixel in the source image to its matching pixel in a target image. This can be a challenging problem when images of different objects, or at different viewpoints, are involved.

Deep Learning is currently dominating the field of dense semantic correspondences. These types of models can learn to match learned-semantic features, rather than traditional local features (like SURF,SIFT, etc.). However, the problem is far from solved: new datasets like SPair-71k pose more difficult warps- ones with increased viewpoint variations, occlusion, and more.

This project is posed to help existing models deal with difficult, large-variation warps by using a latent image manifold to provide a series of smaller-variation, less-difficult warps to incrementally warp the source to the target image. The construction of this latent image manifold is a difficult problem- we use a graph to represent the natural relationships between semantic images, where similar images are connected by edges in this graph.

In order to create the hierarchy, we need a metric of dissimilarity of two given images. In this project, we propose that to success of a dense correspondence between two images also reflects the dissimilarity of two images. If the dense correspondence is very successful, we expect the images to be similar- and if not, we expect the images to be dissimilar. The primary component of this project is the creation and training of an Image Embedding Neural Network where the Euclidean Distance between two image embeddings reflects the success of a dense correspondence between the two images.

With this network, we can create the image manifold. In the Geodesic Flow process, the source and target image find their nearest neighbor (most similar image) among the latent image manifold. Then, the geodesic path is found as the shortest path between the two nearest neighbors traversing the graph. This provides a series of images that more naturally represent the transformation from the source to the target image.

From here, the original model can be used to warp the images incrementally. We expect some training to be required in order for the model to handle the geodesic warping better. The goal of this project is to show that geodesic warps have higher accuracy than the standalone warps. We aim to improve the effectiveness (a current state of the art model) with Geodesic Flow.

Challenges

After figuring out how exactly I wanted to set up my image embedding network, I had difficulty with the training process.

  • I built the network off of the features extracted by SFNet so it inherits the "object awareness".
  • However, the following layers are uninitialized (3 convolution layers and 2 FC layers). Thus, I needed to train the network, and I thought of two training objectives which both use the triplet loss function:
  • One: The Embedding Space should act as a classifier; images of the bicycle class should be separated from images of boats and trains. This required modification of the training data (using VOC2012) to be aimed towards classification.
  • Two: The distance between two vectors in the Embedding Space should reflect the quality of a dense correspondences between two images. This requires use of SFNet to warp images during training, which will most likely be a time-consuming training process (not yet done).
  • Training for both of these objectives may be counter each other- and since objective two is the more important one, I plan to train for objective one and then use "transfer learning" before training on Two.
  • There is a difficult question on how to measure the success of the embedding network- besides lowering loss, we need some way to quantify how well it is performing (so we know how to adjust training parameters).

Insights

There aren't really any good results to show yet. SFNet already produces quality warps, so further results can only be shown when the manifold is created and we can produce the geodesic paths.

I was impressed that my first training round (classification) was able to reduce the triplet loss successfully, but without a proper accuracy measure, I can't trust it. This is something I need to work out.

Plans

I am not quite on track with my original plan, I am definitely behind. Luckily, I believe the embedding network will be the majority of the project, with the rest being rather straightforward.

I will need to devote some more time into quantifying the success of the embedding network. It's main purpose is to produce a meaningful series of images between the source and target image, so I plan to incorporate visualizations as a measure of quality. For this reason, I will have to rework some of the plans to have the visualizations ability ready before I can completely use the embedding net.

I am hoping that I will have time to retrain SFNet to use a larger feature space. Currently, it used a 20x20 feature space to represent the flow over the entire image. The training objective here would be to increase the feature space (to 32x32 or 40x40) and retrain in order to produce more smooth flow fields, which are better suited for geodesic flow (smaller variation warps).

Log in or sign up for Devpost to join the conversation.