Inspiration
The three teammates hail from India, a country where nearly 75% of students do not have English as their medium of instruction in schools. This is nearly 200M people, who are educated on paper, but do not have, or more importantly, can not have access to the rich quality of education available online in English. Think Youtube videos, Khan Academy, Coursera, and even higher education programs at universities. This barrier of language not only limits access to this content, but as a proxy, limits the growth these individuals can have as citizens of this global society. This is a glaring problem in India, as well as dozens of other countries not speaking English, and more than a billion students in those countries. What we felt was that the recent accelerated growth in AI can provide us with some assistance in this problem space, and our project is a great starting point for a strong initiative.
What it does
Our product, Nalanda.AI in its essence is a set of tools that can be used to augment the online education experience, bringing it all into your native language. In its simplest form, it can translate an online lecture from English to a local language, providing both captions as well as an audio track in that language. On top of that, it even provides a summary in the user's language, followed by a small quiz to test their grasp of the concept being taught. The technology powering this user-controlled experience is also offered to businesses like universities to offer their world-class programs in different languages and extend their reach and market more than what was possible before, all automated by AI. Our offerings to businesses can be highly customized, thanks to the context their entire content base can provide, providing a superior experience than what just one translated video can. This dual model also helps as a rough yet valid business plan, where the revenue generated from our business clients can then be used to make our user-facing offerings as cheap as possible, if not free.
How we built it
As the core MVP to showcase the functionality of our product, we decided to build APIs that businesses can use to build their pipelines as well and can also work as the source for a web app that users can use to translate their learning experience from a youtube video. The demo web app allows the user to submit a video URL, and then select the language they are fluent with. For this MVP, we only have Hindi as the target language here as we were comfortable testing it as native speakers. Once the URL is submitted, this is the flow that follows through our backend API services. It can also be seen in the attached diagram.
- Extract audio from the video using
yt-dlpPython package. - Extract the English subtitles as well as the text transcript from the audio using the Open AI Speech API. We use the additional prompts parameter provided by the API to give some context and improve the output here. For example, if we prompt it with "This is subtitles from a lecture on in ", it makes much better transcriptions for domain-specific things. Things like not mistaking "GPT-3" for "GDP-3".
- Translate this subtitle file to a Hindi subtitle file using the Azure Cognitive Services Translate API.
- Generate a Hindi audio file using these subtitles, which then is passed back to the client, to be played over the original video.
- The text generated in step 2 is used to generate a summary and a quiz using the Open AI Conversation API. This summary is then translated with Azure and passed to the client.
Challenges we ran into
- Syncing the final audio produced with the video was the most challenging aspect, from an algorithmic standpoint. The output from the text-to-speech API can have a certain cadence that might not match the cadence of the speaker. To solve it, we compared the audio to the original one, calculated a multiplier in speed for each sentence, and then passed that on to the API to generate audio with variable speeds, which eventually match the speaker.
- We had some issues with working with Youtube embeddings in the webpage and overlaying our generated subtitles and audio. We eventually figured out a solution by not modifying the youtube video but having our output outside it.
Accomplishments that we're proud of
Listening to popular lectures in our native language, was an experience we never imagined could have happened. We are proud that we created something which can bring the same joy to millions of other people :)
What we learned
The absolute power of the emerging LLM technologies if used in the right manner can be immense. This product of ours which we built in 36 hours has not only the potential to change someone's life through their education, but it can do it at scale. The APIs provided by companies like Open AI, go a long way in democratizing the use of these technologies. It allows engineers like us to build user-oriented tools and to completely realize the potential at hand here.
What's next for NalandaAI?
There are many further improvements that can be done, making the offering truly comprehensive.
- Translating the text displayed in a video, think of a slide-heavy lecture. Although we did know how to do this, we, unfortunately, did not have the time to include this in our MVP. Having this would make the experience even more inclusive.
- Incorporating other languages. Hindi worked pretty well, but other languages would require solid modularity for our services as well as some quality testing.
- Separating the audio track from the other tracks and using just that for translation. This would allow for background music etc. to be intact.
- Apart from technical increments, some more thoughts on the potential business aspect can help us identify the profitability and the coverage our services can have.
Log in or sign up for Devpost to join the conversation.