Real-Time Speech to Image Generation

Inspiration

This project was inspired by a paper called Zero-Shot Text-to-Image Generation, by Open AI, that introduces an autoregressive language model called DALL·E trained on images broken into segments that are given natural language descriptions.

What it does

Using the Assembly AI audio transcription API we are able to reproduce elements of the zero-shot capabilities presented in the DALL·E paper, in real-time, using a much less complex model that is trained on a summarized version of the training data they refer to as a meta-data set in the paper Less is More: Summary of Long Instructions is Better for Program Synthesis.

This is only possible with the combination of robust features present in the Assembly AI API that allow for seamless integration with the machine learning model and web interface framework as well as the corrective language modeling they use to repair malformed input and isolate sentences as they are spoken.

How we built it

There are two major components to this project

Real-time audio transcription
- Interface with the Assembly AI API
- The client-side interface was made with HTML and CSS.
- The server is hosted by Node.js using Express
  - The server will open up a connector to the Assembly AI API after a button press is detected on the client-side
Text to Image generation
- Pretrained Models are deployed on the local machine and a Pytorch model runs in parallel with the client and server
- After an asynchronous connection is established with the Assembly AI AIP, Selenium will pass messages back and forth between the client and the pretrained model as the API responds with text transcribed from audio data
- Selenium will update the image on the client-side while the model also saves images to the local drive

Challenges we ran into

The size of the language model needs to be portable enough to deploy on our laptops without a GPU
Latency between the message passing on the front to the back to the model and back to the front end needs to be contiguous and operate at or near real-time

Accomplishments that we're proud of

Designing and implementing an application in a short amount of time
Surprisingly good zero-shot results considering the size of the training set and the incoherence of language spoken naturally

What we learned

Audio transcription tools are getting more impressive all the time
Being able to produce similar results to state-of-the-art with significantly less data implies that the larger models are trained on some amount of superfluous data
Changing the pretraining paradigm can make a big difference

What's next for Speech-to-Image Generation

Using a knowledge graph like ATOMIC, as a basis for commonsense, to check the semantical correctness of the object associations present in the image during the generation phase in order to get better images
Associating natural language with a series of images in a vector that can visually describe a sequence of events written in text
- Generating a video from the resulting image vector