We saw a remarkable opportunity in the ChatGPT API to enable a radically personalized educational experience to users based on their unique spoken input, and decided to create this speaking tool to enhance language learning via personalized conversation practice. Our tutor client gently corrects mistakes and keeps pushing the envelope regarding how complex of a conversation the user can handle in the foreign language. With the advent of GPT-4 and recent voice-related AI advancements, we knew our goal of creating this personalized learning tool was very achievable.
What it does
The tutor client’s primary objective is to sustain an engaging conversation with the user, entirely in the user’s desired language of study, through which it can help the user practice, learn, and grow. Initially, the client listens to speaker input, recording a file based on keyboard control and passing it to our model. The client supports functionality in English, Russian, or Spanish for our current model, with the potential to scale to many, many more languages. Our model then speaks back to the user while following various behavioral guidelines, such as: finding mistakes in the user's speech, suggesting ways of correcting mistakes, providing definitions of confusing words, responding to questions, continuing the conversation with new content based on user preferences, etc. This ultimately allowed us to create a highly personalized and interactive language learning experience for users that is centered around the development of practical experience.
How we built it
We used the Whisper model’s API to convert the inputted audio speech (recorded from a clipping client we wrote in Python) into string text of the input language. We then took this text as the input for our GPT-4 API. We also carefully engineered a prompt that took in any native language, along with the language the user is trying to learn, and guides the user towards catching and correcting mistakes, defining unfamiliar words, (etc) in the target language. The GPT-4 API outputs the model’s response to the input as a string, which we then passed to the Google Cloud text-to-speech model (choosing a voice based on the appropriate language) to have the GPT-4 output be read out loud in the user’s desired language.
Each of these functions existed as separate components, which we then joined together to make a single coherent pipeline that takes in audio (user speech) as input and outputs an audio file (client speech), while also being able to have the conversation build on and extend the ideas discussed in previous prompts. This helped add to the fluidity and connectivity of our model. We took painstaking care to ensure that each element of the pipeline fit in well together.
Challenges we ran into
Connecting all of the different APIs and modules that enabled each small piece of functionality so that we could create a single cohesive model was far more complex than we expected. We encountered many issues with making sure we had consistent dependencies and versions for the packages we needed for our project. Another difficulty we faced was being able to carefully engineer the behavioral prompts that defined our model’s comportment, to make sure the model returned the most optimal answer for each question.
Accomplishments that we're proud of
We’re proud that our final product is as personalized as it is: users can ask the model about almost anything and get a response that is specifically tailored to their desired language of study, level of sophistication, interests, requests, etc. Our model is not the kind of voice assistant that says “sorry, I don’t know about that” -- it can adapt to almost any request or situation.
What we learned
We learned that a great deal can be done in 36 hours if you have enough ambition and vision (and to some extent, caffeine). Specifically, we learned how powerful API’s can be to bring together many different existing softwares and products to help aid the implementation of our vision. Another crucial lesson we learned was the importance of having clean, abstracted, organized code, as this makes the debugging process far easier and allows us to add/modify features with greater ease. We also learned that debugging is a very important skill and that learning things such as bash/zsh can make development easier in the future.
What's next for PolyglotPro
We hope to make PolyglotPro remember previous interactions and keep track of the types of mistakes the user makes in a certain language. This would allow the model to pick up where it left off with a particular user in a previous experience, rather than only preserving experience data for one instance of learning. This would also enable the model to provide personalized guidance to the user by specially practicing parts of the language they struggle with, and also track the user’s improvement in specific areas over time. We can even implement improvement dashboards/interfaces to help the user see their progress in a tangible way, such as the graphical components on Khan Academy or Leetcode.
Another area we could work on is using better voice generation API’s, as our current model can sound slightly monotonous at times; it could have improved intonation.
We also hope to make the model more closely tailored to the user’s level of sophistication in a particular language so that it more closely mirrors the skill level of the user, providing a more fitting track for improvement.This would allow for more sustainable improvement in both vocabulary and grammar, without any excessive pushes in learning difficulty for users.
DISCORD: BeepBoop#0417 s_elez9#0670