Inspiration: Standup meetings can be hell at worst and a slight waste of time at best. What if you can fully automate yourself out of the process - and get a transcript of the resulting call?
How we built it: We first take a picture of the subject we are trying to automate. To get the other party's speech input, we use virtual cabling to route incoming audio data to our microphone, which is then passed into our local instance that calls GPT's realtime speech recognition. The conversation then powered by a GPT3 fine-tined by our own prompts to generate realistic video call conversation responses. Our generated sentences are then sent to our speech synthesizer that is an implementation of SV2TTS that works in real time. The lifelike, personalized audio files are then sent to our inference implementation of Audio2Head, which generates a video of our subject lipsyncing (deepfaking) the conversation in the audio .wav file. We then stream this via OBS to an active Google Meets call.
- Infrastructure hell (getting a local compute environment with 24GB GPU ram), setups with CUDA, PyTorch, FFMPEG, compiling CPython on outdated Python version, etc
- The timing of the audio conversation - detecting when speech is finished to know when to stop recording and then respond
- Connecting everything together is a pain
- Performance: Inference takes about 1.5s per second for our entire pipeline. Figuring out ways to reduce latency
We're running out of time to finish our story paragraph! More info on our GitHub README.