Inspiration

Over 2 billion people across Asia speak languages like Mandarin, Hindi, Japanese, Korean, Tamil, and Cantonese — yet the overwhelming majority of speech recognition and NLP tools are built for English. We saw this gap firsthand in our own communities: parents who couldn't use voice assistants in their native tongue, students whose accents were misread, and professionals whose multilingual meetings produced garbled transcripts. We asked ourselves: what if speech technology actually worked for Asian languages — not just transcribing words, but understanding structure, tone, and meaning? That question became this project.

What it does

Structured Speech Intelligence for Asian Languages takes audio input in multiple Asian languages — including Mandarin, Hindi, Japanese, and Korean — and transforms raw speech into clean, structured, actionable output. It goes beyond simple transcription: it identifies speakers, detects sentence boundaries, extracts key topics and action items, and formats everything into a readable, organized document. Think of it as a smart meeting assistant, but built from the ground up for Asian language speakers.

How we built it

We used OpenAI's Whisper model as our speech-to-text backbone, fine-tuned for tonal and script-heavy Asian languages. Language detection runs automatically on incoming audio. From there, we pipe transcriptions through a custom NLP layer powered by the Anthropic Claude API to extract structure — summaries, entities, and action items — with prompts tailored per language. The frontend is built in React, the backend in Python with FastAPI, and we used Firebase for real-time storage and session management. Challenges we ran into Tonal languages like Mandarin and Cantonese were our biggest technical hurdle — a single syllable can mean four different things depending on pitch, and standard models often flatten that nuance. We also struggled with code-switching, where speakers fluidly mix languages mid-sentence (e.g. Hinglish or Taglish), which broke most off-the-shelf pipelines. Punctuation and sentence segmentation in Japanese and Korean, which lack spaces between words the way English does, required entirely different post-processing logic. And of course, doing all of this in a hackathon timeframe was its own challenge. Accomplishments that we're proud of We successfully built a working pipeline that handles Mandarin, Hindi, Japanese, and Korean end-to-end — from raw audio to structured document — in under 30 seconds. We're especially proud of our code-switching detection, which gracefully handles mixed-language input rather than crashing or defaulting to one language. We also built a clean, intuitive UI that makes the tool accessible to non-technical users who just want to upload an audio file and get useful output back. What we learned We learned that language is never just about words — structure, culture, and context are deeply embedded in how people speak. We also learned how much of the existing NLP ecosystem assumes English, and how many small but important design decisions (like punctuation rules, reading direction, or honorific levels in Korean) completely change how you build a pipeline. Most importantly, we learned that building for underserved communities requires genuine curiosity and humility, not just technical skill. What's next for Structured Speech Intelligence for Asian Languages We want to expand support to 10+ Asian languages, including Vietnamese, Thai, Tagalog, and Bengali. We're exploring real-time transcription so the tool works live during meetings or lectures. Longer term, we want to build dialect-aware models — because Cantonese is not Mandarin, and Punjabi is not Hindi — and partner with schools and community organizations across Asian diaspora communities to put this tool in the hands of people who need it most.

Built With

  • claude
Share this project:

Updates