Inspiration
Learning languages has been a long-time interest of mines. The idea of orAIte presented the perfect opportunity for me to combine my interest in language learning with programming and AI. I was inspired by how pronunciation proved to be one of the more challenging aspects of learning languages. Prospective learners often find themselves blind to the pronunciation errors they make while learning a new language (hence the persistence of native language influenced-phonetics in second languages). It is also often hard to find an opportunity to converse in your target language, exacerbating the issue even more.
In addition, accents and speech impediments have often formed a significant barrier to advancement for marginalized communities, often leading to difficulties communicating with other people, and as a result, diminished opportunities for employment, education, and more. Many existing language-learning tools disregard pronunciation and do not provide people with the nuanced and personalized guidance that those with atypical speech patterns need. My goal with orAIte is to democratize pronunciation so that anyone can receive instant, accurate, and useful feedback, and a no-judgement zone where people can feel safe to practice their pronunciation however many times they need in order to master it.
orAIte uses AI to solve this issue, serving as an effective pronunciation trainer that will tell you exactly what you are pronouncing wrong and how you are pronouncing it wrong. It compares your speech against authentic native speakers, which provides an objective standard with which your own speech can be accurately assessed.
The name "orAIte" is derived from a portmanteau between the verb "orate," a Latinate verb meaning "to speak," and AI, reflecting its aim to leverage AI to improve people's oratory skills. The name is intended to be a homophone of "orate." It is also a reflection of the primary goal of the program: through using AI to improve people's speech, it aims to give people the confidence that they need to be the orators they always deserved to be.
What it does
orAIte allows users to practice their pronunciation of individual words. Through leveraging Microsoft's Azure Speech processing and the vast corpus of authentic native pronunciation of words available on Wiktionary, orAIte systematically compares your pronunciation to that of native speakers. It provides users an easy way to listen to their target word's pronunciation, a % accuracy score compared to native speakers, as well as a phonetic breakdown of where exactly their pronunciation differed from native speakers. Accuracy is computed via dynamic time warping (DTW) between the user's spoken input and the native speaker sample. Hovering over the different phonetic components in the phonetic breakdown provides users with the IPA (International Phonetic Alphabet) transcription, the gold standard for phonetic accuracy in linguistics, as well as an example for how the vowel sounds (ex. the tooltip for "ah" reads "IPA: ɑ, Example: a in father"), leaving no ambiguity on how words should be pronounced.
Beyond language learning, orAIte also holds significant potential to aid those who have speech impediments and those who want to be more clear in their speech. Its ability to point out precisely how one's speech differs from standard speech samples holds great potential to help people of all kinds improve their speech and gain a newfound empowerment and confidence.
How I built it
For the backend, I used Flask to provide the key logic for converting audio into phonetic transcriptions as well as scraping Wiktionary. Indeed, Wiktionary is the source of native speaker audio and Azure processes the audio of both the native speaker sample and the user provided speech. To calculate the accuracy of user speech against native speech, I used the dynamic time warping algorithm. As for the frontend, I used HTML, CSS, and JavaScript.
Challenges I ran into
Figuring out how to scrape Wiktionary properly for native audio was quite a challenge. I also had some difficulty figuring out a way to make sure that I was only scraping the audio for the target language (that is to say not accidentally scraping Spanish words when the language is set to English.) I ended up having to learn a lot about regex functions in order to figure out how to split up the html properly to get what I needed.
Accomplishments that I'm proud of
I am proud that my project is able to provide conclusive and solid feedback for anyone who is seeking to improve their pronunciation of words. The fact that it has the potential to help and empower language learners, those with speech impediments, and anyone else who seeks to improve their speech and intelligibility makes me incredibly proud. It is quite accurate in its transcriptions as well. I am also proud that I was able to ship a fully functional concept within 12 hours.
What I learned
Like I mentioned earlier, I learned a lot about regex functions. I also learned a lot about the Azure Speech API, which I had never used before this project. I also learned about the dynamic time warping algorithm, which I used to compute the accuracy of user's spoken words. In addition, I had to interact significantly with Wiktionary's REST API, and I had to use a lot of the Python requests library, which I previously did not have much experience with. The experience with extracting audio from Wiktionary I would say taught me a profound amount since I also had to figure out how to sift through the massive HTML that are the core of Wiktionary pages to get to exactly what I needed.
What's next for orAIte
I think the future for this project would include adding more languages as well as exploring and implementing other ways AI can be used to improve and accelerate language learning. I definitely think expanding the concept beyond individual words and moving towards full phrases and sentences would be incredibly useful. In addition, there is still plenty of room for AI to augment the pronunciation training functionality of this app: perhaps an LLM could use knowledge of your native language to make recommendations on exactly how you can move towards pronouncing words of your target language correctly. In addition, aesthetic upgrades could also be considered, such as a visualization of one's recording, and general UI polishing, in order to make the experience feel more polished and refined.

Log in or sign up for Devpost to join the conversation.