Inspiration

I am a visual learner. Learning with visual/images allows for people like me to grasp concepts faster. EYES is a tool that allows for Audio to video storybook generation to assist visual learners in learning concepts or ideas.

What it does

Introducing EYES, a tool Designed to convert speech into storybook videos. EYES takes a audio file and automatically adds captions and AI generated background images that correspond with what is being said. EYES automatically edits everything together and adds on visually engaging image animations

How we built it

EYES takes in an audio file from a tkinter user interface and uses the open source API Whisper to convert audio into text. The text is then split into segments for captions. Each segment is mapped to its corresponding location with the audio file using an open source forced text alignment software (aeneas). The original text is then split into larger segments. Each segment is sent off to the open source LLM mistral-7b which has prompts that convert the sentence fragment into a vivid image description. Each description is sent to another program that uses stable diffusion xl to generate images. These images are then sent to Mapling, an open source software for image animations. Finally everything is edited together using moviepy.

Challenges we ran into

Designing algorithm to efficiently use multiple gpus for image generation instead of just 1 Finding AI models that can fit on my system Developing the captioning system. Original idea was to break the audio files into segments and link those segments to segments of text but the result was too choppy.

What we learned

Spend a little less time on software development and more time on the submission

What's next for EYES: Audio to Storybook Video generation

Fine Tuned image and description models for more consistent results. Better user interface. Apply to audio books to generate "movies" for each book.

Built With

Share this project:

Updates