Inspiration
We encountered a need for translation services firsthand by helping our parents call different stores, online services, and government facilities. However, current existing solutions are either very slow (high latency with gaps in between translation), complicated user interface (not suitable for elders), or very expensive (such as apple's airpods that cost $300+). Therefore, we want to create a live language translation app that is fast with low-latency, easy to use, and cheap.
What it does
YinLink translates two languages, for instance English and Mandarin live on any phone call so both people can speak naturally while hearing the other language in real time—no app needed on the other end.
How we built it
We built YinLink by streaming live call audio using the LiveKit Api, by joining the caller and callee into one session room and spawning live translation voice agents to carry out real-time English to Mandarin translation. We used the gpt-4o-realtime model to carry out low latency, direct audio to audio translation instead of a modularized pipeline approach (speech to text, then text to text translation). YinLink handles outbound calls using SIP trunking that enables the callee to join the same session as the caller, providing zero additional overhead for the callee allowing for a smooth and easy user experience.
Challenges we ran into
A big challenge we faced is configuring the phone numbers, and carrying our testing steps to verify the SIP trunking feature. Our numbers use Canadian carriers and we do not have roaming enabled, and therefore we cannot use our own numbers to test as our phones currently cannot receive calls in the United States. We had to borrow other participant's US-based phones to verify whether they can recieve the call from our program. Additionally, we tested on 24 hour customer service providers as well as restaurant take-outs to validate the real-world feasibility of our app.
Accomplishments that we're proud of
We are proud of achieving the 'ghost' voice bridge translator using the LiveKit api by spawning an individual voice agent for each participant in the session room. Previously we tried to use one voice agent per session, but we quickly noticed that the voice agent can only subscribe to one participant at a time. As a workaround, we decided to spawn a voice agent for each participant, and using our ghost translator protocol to carry out live real-time translation. We are also proud of enabling international outbound calling, such as calling to both US and Canadian phone numbers and verifying that the translation logic works perfectly. When we first heard our friend's voices for the first time through the app, it is an "ah-ha" moment: our app is working in the real world, and we have made a real, feasible solution that can improve cross-language translation for so many users across the world.
What we learned
A big takeaway for us is that testing and evaluating a product is just as important as building it. For instance, the majority of our effort were spent testing out the phone numbers, carrying out calls to different people, and working around our carrier limitation to test the product. We did not think about the limitation that our phone numbers would not work in the United States, and we did not think about a backup phone number or backup testing mechanism. Moving forward, before we dive into implementation, it is important to think about feasibility testing and evaluation strategies, such that after we finish building the product, we can easily test and debug to improve our solution.
What's next for YinLink
YinLink currently supports English to Mandarin, and has a web app interface for visualization. Moving forward, we would like to extend language support for 5+ additional languages, such as French, Spanish, Japanese, Cantonese, and Germain, to support more users around the world. We also want to extend the web app into a mobile app by porting from Next.js to react native, so that our app is compatible across IOS and android. Lastly, we want to test our app under long conversation scenarios where the call lasts anywhere between 30min to an hour, and in which we can optimize the LLM context window to avoid hallucinations and ai gibberish.
Built With
- livekit
- openai
- python
- react
- typescript
Log in or sign up for Devpost to join the conversation.