Realtime Multimodal Speech to ImageGenerator 4 Infographics

Inspiration

Time is a'tickin and ain't no body got time to take down notes by hand and organise it into an infographic these days

What it does

It converts what you are saying in real time to text and then summarises the transcibed text, extracts the keywords from the transcribed text and used the keywords to generate an image. The generated image is then combined with the summarised text to create an infographic

How we built it

Used AssemblyAI's Real time transcription to generate a transcription of what the user says
The generated text is then downloaded locally
The downloaded transcription is then run through KeyBERT to have the keywords extracted
A summarised version of the text is then generated using a all-MiniLM-L6-v2
The keywords are then fed into a custom stable diffusion model to generate an image
The generated image and summarised text are then combined into one image using PIL

Challenges we ran into

As a solo entrant, i was unfamiliar with streamlit and websockets, resulting in errors with too many active streams while using AssemblyAI's Real time transcription API and was unable to split/delegate to others as i did not have a team.