SenseSight

Inspiration

Bill - "Blindness is a major problem today and we hope to have a solution that takes a step in solving this"

George - "I like engineering"

We hope our tool gives nonzero contribution to society.

What it does

Generates a description of a scene and reads the description for visually impaired people. Leverages CLIP/recent research advancements and own contributions to solve previously unsolved problem (taking a stab at the unsolved generalized object detection problem i.e. object detection without training labels)

How we built it

SenseSight consists of three modules: recorder, CLIP engine, and text2speech.

Pipeline Overview

Once the user presses the button, the recorder beams it to the compute cluster server. The server runs a temporally representative video frame through the CLIP engine. The CLIP engine is our novel pipeline that emulates human sight to generate a scene description. Finally, the generated description is sent back to the user side, where the text is converted to audio to be read.

Figures

CLIP

CLIP is a model proposed by OpenAI that maps images to embeddings via an image encoder and text to embeddings via a text encoder. Similiar (image, text) pairs will have a higher dot product.

Image captioning with CLIP

We can map the image embeddings to text embeddings via a simple MLP (since image -> text can be thought of as lossy compression). The mapped embedding is fed into a transformer decoder (GPT2) that is fine-tuned to produce text. This process is called CLIP text decoder.

Recognition of Key Image Areas

The issue with Image captioning the fed input is that an image is composed of smaller images. The CLIP text decoder is trained on only images containing one single content (e.g. ImageNet/MS CoCo images). We need to extract the crops of the objects in the image and then apply CLIP text decoder. This process is called generalized object detection

Generalized object detection is unsolved. Most object detection involves training with labels. We propose a viable approach. We sample crops in the scene, just like how human eyes dart around their view. We evaluate the fidelity of these crops i.e. how much information/objects the crop contains by embedding the crop using clip and then searching a database of text embeddings. The database is composed of noun phrases that we extracted. The database can be huge, so we rely on SCANN (Google Research), a pipeline that uses machine learning based vector similarity search.

We then filter all subpar crops. The remaining crops are selected using an algorithm that tries to maximize the spatial coverage of k crop. To do so, we sample many sets of k crops and select the set with the highest all pairs distance.

Challenges we ran into

The hackathon went smoothly, except for the minor inconvenience of getting the server + user side to run in sync.

Accomplishments that we're proud of

Platform replicates the human visual process with decent results. Subproblem is generalized object detection-- proposed approach involving CLIP embeddings and fast vector similarity search Got hardware + local + server (machine learning models on MIT cluster) + remote apis to work in sync

What's next for SenseSight

Better clip text decoder. Crops tend to generate redundant sentences, so additional pruning is needed. Use GPT3 to remove the redundancy and make the speech flower.

Realtime can be accomplished by using real networking protocols instead of scp + time.sleep hacks. To accelerate inference on crops, we can do multi GPU.