Cosine Sim and CKA for text-video comparison

Inspiration

A paper: Do Vision and Language Encoders Represent the World Similarly?

What it does

My tool is designed to assess the degree of correspondence between video content and textual descriptions. It operates by analyzing video frames and comparing them with provided text, effectively quantifying the alignment between the visual and textual elements. This capability is particularly useful in various applications, such as content verification, where it's essential to ensure that video clips accurately reflect their associated captions or descriptions.

How we built it

My tool implements mathematical algorithms for Centered Kernel Alignment (CKA) and Cosine Similarity to establish the relationship between images and text. It trains a Linear Transformation model to harmonize the dimensions of image embeddings with those of text embeddings, allowing for a precise comparison. This process involves adjusting the feature space of image data to align with the textual data, enabling the evaluation of how closely images correspond to their textual descriptions. This approach is invaluable in tasks where the congruence between visual content and text is crucial, such as in automated captioning systems or multimedia content indexing.