Introduction:
As humans turn their sights towards exploring worlds beyond the realm of Earth, understanding the landscape of interstellar objects will play a crucial role in our understanding of our planetary neighbors and our prospects for future missions to them. NASA has provided 73,031 Martian landmark images from the High Resolution Imaging Experiment (HiRISE). While some of the images are labeled, many of the landmark images are unclassified but thought to be of importance. Given this data, we are interested in developing a deep learning-based clustering model for generating class labels for these unclassified images.
Challenges:
Our k-means clustering algorithm which we are using as a baseline comparison point against our model is rather weak. Although it will certainly provide a better-than-random comparison point for our performance metrics, it might be hard to quantify any model improvements that are significantly better than k-means. In addition, the deep adaptive clustering model will have strongly varying degrees of performance depending on thresholds for clusters and other hyperparameters. Given the seeming arbitrary nature of the clusters initially and the large computational complexity, choosing these parameters will pose challenges.
Insights:
So far, we have extracted parts of our data set to gain a better understanding of how it is formatted and any modifications we might need to make for it to work with our model. Although the preprocessing stage is not done yet, we have laid most of the groundwork for completing that part.
We have also run a k-means clustering algorithm on part of our data set, so we have a better idea of what our minimum performance metric looks like and what number of clusters best fits the unlabeled data. In this, we have implemented a feature extractor using MobileNet and PCA, which reduces the dimensionality to 10 before running k-means. With a small data sample of ~800 unlabeled images, the silhouette score, which ranges from -1 to 1 (-1 means the worst possible cluster fit, 0 means similar to random noise, and 1 means optimal cluster fit), for most values of k range in the 0.2 to 0.3 range, indicating a difficult task ahead. The highest value is k=3 for 3 broadly defined clusters. This basic intuition will be quite helpful in guiding our full model implementation using deep adaptive clustering.
Plan:
As of now, we believe that we are on track with our project. We plan to stick to our existing project plan, where we compare our model against a baseline k-means clustering algorithm to determine a minimal performance threshold. Once we meet that threshold in various metrics such as the Silhouette score, we will proceed by qualitatively examining our clustering for effectiveness.
While we are currently working on our project on local machines, we are likely going to move our work to a cloud provider like Google Cloud Platform or AWS once we have some concrete results. Currently, we plan to devote more time to laying the groundwork for an easy transition to cloud computing once our model is ready.
Log in or sign up for Devpost to join the conversation.