Inspiration

Time is a'tickin and ain't no body got time to take down notes by hand and organise it into an infographic these days

What it does

It converts what you are saying in real time to text and then summarises the transcibed text, extracts the keywords from the transcribed text and used the keywords to generate an image. The generated image is then combined with the summarised text to create an infographic

How we built it

  1. Used AssemblyAI's Real time transcription to generate a transcription of what the user says
  2. The generated text is then downloaded locally
  3. The downloaded transcription is then run through KeyBERT to have the keywords extracted
  4. A summarised version of the text is then generated using a all-MiniLM-L6-v2
  5. The keywords are then fed into a custom stable diffusion model to generate an image
  6. The generated image and summarised text are then combined into one image using PIL

Challenges we ran into

As a solo entrant, i was unfamiliar with streamlit and websockets, resulting in errors with too many active streams while using AssemblyAI's Real time transcription API and was unable to split/delegate to others as i did not have a team.

Accomplishments that we're proud of

used multiple SOTA models and integrated them together into a streamlit app that is able to run locally on a laptop with at least 6gm of GPU VRAM

What we learned

I should enter with a team next time

What's next for Realtime Multimodal Speech to ImageGenerator 4 Infographics

Polishing and refinement

video links are https://youtu.be/Yhiq_EVyBcg and https://youtu.be/Kg7bUWykPLk

Built With

Share this project:

Updates