Audible Frames

Inspiration

It is useful for the visually impaired.

The model can generate an audio from the image. It first converts an image to text then to audio.

We used GPT 4-o model from OpenAI for caption generation on the image. In addition, we used ESPnet model to convert from text to audio.

Accessing a valid API to use Open AI models was a challenging.

Our model successfully identified and described the Statue of Liberty in an image where it appeared very small.

We learned to access API for using OpenAI models and to fine tune large language models.

Envisioning smart glasses that capture images and provide real-time auditory descriptions to assist visually impaired individuals.

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.