My Life is a Comedy - Building Singlish-tuned transcription

Inspiration

Struggling to spitball ideas after being told the challenge category is Open, I receive a long voice memo and realise the best problems to solve are frustrations I personally experience!

What it does

The theory was to have a distilled version of whisper, that works on device to send accurately transcribed Singlish and Singaporean-accented English.

How we built it

Theory: Scraped data from Mozilla Common Voice, IMDA National Speech Corpus (NSC) and YouTube for Singlish phrases, then chuck into whisper and let it figure it out.

I chose Whisper, as it is the best (free to use and open) transcription model currently available, but also as it is trained on multi-lingual data, which in theory should improve performance when fine-tuned, since Singlish creole is a blend of multiple languages.

The theory was to fine tune Whisper, using the Hugging Face demo with their pre-written Colab Jupyter notebook on Singlish speech audio and text pairings.

Then, as a stretch goal, distilling the model down to run on device, which has already been accomplished by Aleksandras Kostarevas at FUTO using the native android keyboard voice input, or android.speech.action.RECOGNIZE_SPEECH implicit intent for apps that support it. If that is not ready, have a web endpoint to submit a mp3 file for MVP demo.

Challenges we ran into

Datasets of Singlish speakers are sparse, or already in the Whisper TTS model. For example, the Mozilla Common Voice dataset has Singaporean English speakers across a wide range of ethnicities and ages and hardware/background noise (good diversity for training data!) But OpenAI has already sucked in that open dataset into theirs. Additionally, Singlish phrases are not included in the data set (e.g.: Steady Pom Pipi, or knnccb)

IMDA has one high quality dataset with a 1000 diverse speakers, but it uses high quality audio that is not representative of noisy environments where it would be used. Additionally, you need to fill up a google form, and the data is hosted on dropbox, which is a little challenging to import into kaggle/colab for processing (1.2TB of data as well 💀)

Additionally, I was working solo and brushing data and lost my laptop halfway in, and went on a journey through Singapore. This is a heavy reminder to use Git for all work :_)

Accomplishments that we're proud of

Making other people laugh.

What we learned

Rather than stressing on fine-tuning, the initial_prompt: to use singlish, and providing examples of Singlish phrases into the context window may have been sufficient??? Always go the easier route.

What's next for My Life is a Comedy - Building Singlish-tuned transcription

Google Search actually has stored your voice history - it would be interesting to tailor a general-purpose speech recognition model to your own voice, it could also have great potential for those with speech impediments or non-standard speech to access accurate speech transcription. Personally i lisp my consonants a lot, so this has really good future work potential.

Built With

colab
gcp
kaggle
python

Updates

sam tan started this project — Jun 23, 2024 12:43 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.