✨ Inspiration

In the era of online communications, where online meetings are already draining enough [1], st-st-stammering and uh, uhm, filler words seem to creep into every conversation. Filler words—like, uh, um, you know—have been found to lower listeners’ comprehension, interrupt the natural flow of speech, and reduce credibility [2].

Relevate, my project for Hack the North 2023, helps to tackle this issue by using multiple neural networks, one trained on my own synthesized dataset, to automatically perfect oration.


🚀 What it does

Relevate is a desktop app that elevates discussions by removing irrelevant filler words and stuttering in real-time. It receives your voice from your microphone, and then creates a virtual microphone that is completely stutter and filler word free. You can use this virtual microphone anywhere you could use a regular microphone as an input source: during chats with friends on Discord, interviews with recruiters on Zoom, or even recordings of podcasts on Voice Memos!

relevate = relevant and elevated oration 🧩


⚙️ How I built it

flowchart

I built the user-facing GUI with Python and Tkinter, and used PyAudio for the audio interfacing. The user GUI uses WebSockets to communicate to the backend server, which is written in Python and FastAPI. The backend server runs OpenAI’s state-of-the-art speech-to-text model, Whisper, locally (using the API would be too slow because an entire file would need to be streamed). OpenAI Whisper usually ignores filler words, so I needed to write a custom few-shot prompt to bring those back. To map the transcripts from Whisper to actual timestamps, I used dynamic time warping. Using Whisper, the backend has a transcript of the chunk of audio; next, the backend needs to detect the filler words and stammering in the transcript and link it back to timestamps in the audio.

Detecting filler words and stammering is non-trivial because it is often context-dependent (because of filler words such as the word "like''; “I like hackathons” vs. “I, like, don’t know”), so I trained my own bi-directional transformer (BERT style) to detect stammering and filler words. Specifically, I trained RoBERTa-large for the token classification task (whether the word should be removed or not) on a dataset that I synthesized myself by taking many sentences from OpenWebText, adding random filler words, and adding stammer to the text. After training, the backend uses this model to find the spans of text that need to be removed and the corresponding timestamps of the audio that need to be removed. After the timestamps are removed in real-time, the resulting audio is streamed directly to BlackHole, a virtual audio loopback driver, to make the virtual audio microphone work.


🚧 Challenges I ran into

  • Latency in machine learning models and existing APIs was a major issue while building this app
    • OpenAI’s Whisper API was not sufficient as they require you to upload an entire file and cannot easily stream it; hence, I hosted it locally (where streaming is possible)
  • Real-time audio processing is very error-prone and I ran into lots of difficult-to-debug issues


😁 Accomplishments I’m proud of

I’m extremely proud of being able to complete this project on my own. Dealing with audio processing (especially real-time streaming) was a first for me (and I found it quite challenging), so I’m pleased about being able to learn and turn my project into a usable app.


📚 What I learned

  • How to deal with real-time audio processing
  • Frameworks/libraries: PyAudio, Tkinter, BlackHole

  • On another note, I learned that speech therapy doesn’t work for some 20% of people who stutter, according to one study [3]; this tool is highly critical for these people


⏭ What’s next for Relevate

Overall I’m really happy with what I made! In the future, I plan to…

  • Package this as an application
  • Make it work for Windows (BlackHole only works for macOS currently, but it should be fairly straightforward to support a Windows alternative)



References

  1. Troje, Nikolaus F. “Zoom disrupts eye contact behaviour: problems and solutions” Trends in Cognitive Sciences, Volume 27, Issue 5, May 2023
  2. Seals, Douglas, et al. “We, um, have, like, a problem: excessive use of fillers in scientific speech”, Advances in Physiology Education, Volume 46 Issue 4, December 2022
  3. Langevin, Marilyn, et al. “Five-year longitudinal treatment outcomes of the ISTAR Comprehensive Stuttering Program”, Journal of Fluency Disorders, Volume 35, Issue 2, June 2010

Built With

  • blackhole
  • fastapi
  • huggingface
  • pyaudio
  • python
Share this project:

Updates