Inspiration
Time is a'tickin and ain't no body got time to take down notes by hand and organise it into an infographic these days
What it does
It converts what you are saying in real time to text and then summarises the transcibed text, extracts the keywords from the transcribed text and used the keywords to generate an image. The generated image is then combined with the summarised text to create an infographic
How we built it
- Used AssemblyAI's Real time transcription to generate a transcription of what the user says
- The generated text is then downloaded locally
- The downloaded transcription is then run through KeyBERT to have the keywords extracted
- A summarised version of the text is then generated using a all-MiniLM-L6-v2
- The keywords are then fed into a custom stable diffusion model to generate an image
- The generated image and summarised text are then combined into one image using PIL
Challenges we ran into
As a solo entrant, i was unfamiliar with streamlit and websockets, resulting in errors with too many active streams while using AssemblyAI's Real time transcription API and was unable to split/delegate to others as i did not have a team.
Accomplishments that we're proud of
used multiple SOTA models and integrated them together into a streamlit app that is able to run locally on a laptop with at least 6gm of GPU VRAM
What we learned
I should enter with a team next time
What's next for Realtime Multimodal Speech to ImageGenerator 4 Infographics
Polishing and refinement
video links are https://youtu.be/Yhiq_EVyBcg and https://youtu.be/Kg7bUWykPLk
Built With
- assemblyai
- bert
- natural-language-processing
- python
- pytorch
- stablediffusion
- streamlit
Log in or sign up for Devpost to join the conversation.