Introduction
The purpose of the dense correspondence problem is to create a dense vector field from each pixel in the source image to its matching pixel in a target image. This can be a challenging problem when images of different objects, or at different viewpoints, are involved.
Deep Learning is currently dominating the field of dense semantic correspondences. These types of models can learn to match learned-semantic features, rather than traditional local features (like SURF,SIFT, etc.). However, the problem is far from solved: new datasets like SPair-71k pose more difficult warps- ones with increased viewpoint variations, occlusion, and more.
This project is posed to help existing models deal with difficult, large-variation warps by using a latent image manifold to provide a series of smaller-variation, less-difficult warps to incrementally warp the source to the target image. The construction of this latent image manifold is a difficult problem- we use a graph to represent the natural relationships between semantic images, where similar images are connected by edges in this graph.
In order to create the hierarchy, we need a metric of dissimilarity of two given images. In this project, we propose that to success of a dense correspondence between two images also reflects the dissimilarity of two images. If the dense correspondence is very successful, we expect the images to be similar- and if not, we expect the images to be dissimilar. The primary component of this project is the creation and training of an Image Embedding Neural Network where the Euclidean Distance between two image embeddings reflects the success of a dense correspondence between the two images.
With this network, we can create the image manifold. In the Geodesic Flow process, the source and target image find their nearest neighbor (most similar image) among the latent image manifold. Then, the geodesic path is found as the shortest path between the two nearest neighbors traversing the graph. This provides a series of images that more naturally represent the transformation from the source to the target image.
From here, the original model can be used to warp the images incrementally. We expect some training to be required in order for the model to handle the geodesic warping better. The goal of this project is to show that geodesic warps have higher accuracy than the standalone warps. We aim to improve the effectiveness (a current state of the art model) with Geodesic Flow.
Challenges
After figuring out how exactly I wanted to set up my image embedding network, I had difficulty with the training process.
- I built the network off of the features extracted by SFNet so it inherits the "object awareness".
- However, the following layers are uninitialized (3 convolution layers and 2 FC layers). Thus, I needed to train the network, and I thought of two training objectives which both use the triplet loss function:
- One: The Embedding Space should act as a classifier; images of the bicycle class should be separated from images of boats and trains. This required modification of the training data (using VOC2012) to be aimed towards classification.
- Two: The distance between two vectors in the Embedding Space should reflect the quality of a dense correspondences between two images. This requires use of SFNet to warp images during training, which will most likely be a time-consuming training process (not yet done).
- Training for both of these objectives may be counter each other- and since objective two is the more important one, I plan to train for objective one and then use "transfer learning" before training on Two.
- There is a difficult question on how to measure the success of the embedding network- besides lowering loss, we need some way to quantify how well it is performing (so we know how to adjust training parameters).
Insights
There aren't really any good results to show yet. SFNet already produces quality warps, so further results can only be shown when the manifold is created and we can produce the geodesic paths.
I was impressed that my first training round (classification) was able to reduce the triplet loss successfully, but without a proper accuracy measure, I can't trust it. This is something I need to work out.
Plans
I am not quite on track with my original plan, I am definitely behind. Luckily, I believe the embedding network will be the majority of the project, with the rest being rather straightforward.
I will need to devote some more time into quantifying the success of the embedding network. It's main purpose is to produce a meaningful series of images between the source and target image, so I plan to incorporate visualizations as a measure of quality. For this reason, I will have to rework some of the plans to have the visualizations ability ready before I can completely use the embedding net.
I am hoping that I will have time to retrain SFNet to use a larger feature space. Currently, it used a 20x20 feature space to represent the flow over the entire image. The training objective here would be to increase the feature space (to 32x32 or 40x40) and retrain in order to produce more smooth flow fields, which are better suited for geodesic flow (smaller variation warps).
Log in or sign up for Devpost to join the conversation.