Probing Relational Reasoning in Pretrained ViTs

Project Idea: Short Description (100-500 words) Self-supervised vision models such as DINO and CLIP have success learning general visual representations that encode rich images features. These representations are shown to be broadly useful for tasks such as object detection, depth estimation, and classification. However, the degree to which these embeddings carry information on spatial relationships between objects within the image is less clear. This spatial context is important to evaluate, since it has implications for how embeddings should be used. We propose to investigate the presence of spatial understanding in DINO and CLIP using linear probes at multiple layers and datasets such as SpatialSense and VRD. By probing across different model layers, we can see if these capabilities emerge, and if specific layers are more relevant. By analyzing both DINO and CLIP, we can see if the semantic alignment of CLIP visual embeddings enable them to perform better for spatial tasks.

What are some key limitations you anticipate facing when working on this project? (100-500 words) It will take some time to understand the DINO and CLIP architectures, and where it makes sense to intercept embeddings for probing. Additionally, while it is in theory interesting to compare DINO and CLIP probing results to understand how semantic alignment might affect embeddings, setting up this comparison in a way that is actually meaningful might take some planning since there will be misalignment around architecture and embedding dimensions. There are also limitations around access to compute resources. While we predict embedding extraction and probe training to be more lightweight, it will still require GPU resources, and considerations around the size of datasets used for probing and which size variant of the vision foundation models to use will need to be weighed.

Project Data Ideas (attach links) DINO https://arxiv.org/pdf/2104.14294 CLIP https://openai.com/index/clip/ SpatialSense https://arxiv.org/abs/1908.02660 VRD https://www.kaggle.com/datasets/apoorvshekher/visual-relationship-detection-vrd-dataset

Team name: Team 2D