Inspiration

Our generation is witnessing the growing pains of globalisation, as long standing systems clash with the complexities of a connected world. Cultures are meeting in unprecedented ways, yet barriers remain—people struggle to understand and empathise with those who speak different languages. Historically, nations have been forged by language, often through violence or political manoeuvring. Despite this tension, human greatness has emerged from collaboration, sharing, trade, friendship, and love. These connections happen in small ways every day, yet so does the frustration born from an inability to communicate fully. Our global society has empowered us to communicate across cultures, established agreements like the Universal Declaration of Human Rights, and created policies welcoming refugees. But globalisation has also led to conflicts and geopolitical tensions. We've seen friendships blossom across borders and watched as language barriers fuel misunderstandings—from international friendships to football hooliganism.

Language has the power to bring us together. A shared "lingua franca" can bridge divides. English, Arabic, Spanish, French, Hindi, Urdu, Russian, Persian, Mandarin, and others have connected people across regions and beyond borders. The ability to understand multiple languages has empowered unlikely figures throughout history. La Malinche, for example, used her knowledge of Nahuatl, Mayan, and Spanish to influence Hernán Cortés and change the course of Mexican history. Translation wields power and grants access; but there’s more to it than just words.

Beyond geopolitics and history, understanding others is essential. Yet translation often loses context, missing the subtle cultural elements that make communication rich. For instance, it’s not the same to read Gabriel García Márquez's magical realism in translated English as it is to hear him describe it in his native Spanish. Specialists, thinkers, and storytellers often operate in specific languages, making much of their knowledge inaccessible to the broader world. An international couple meeting each other’s families, for instance, might struggle to tell their story in their own words and tone, relying instead on a middle-person who speaks both languages. This "clunky" process complicates the establishment of trust, rapport, and connection.

What if we could use technology to bridge these gaps? What if we could enable people to connect deeply and authentically, regardless of language?

What it does

PROBLEM: As a human I want to connect with other humans that I don’t share a language with.

MISSION: To enable people to communicate in their mother-tongues with people that do not speak the same language, where they most need it.

Solution:

We are building an AI-powered mobile app for real-time translation, enabling seamless, meaningful conversations across languages so people can connect with anyone, anywhere.

Product pillars:

  • Quasi real time translation
  • High quality translation: Content accuracy & Human quality
  • Portable
  • Affordable
  • Personalised: Supported with a knowledge base and can leverage context and necessary prompting. Integrates with tools that provide context: Substack, Whatsapp, or Meta for example. Leverages the Instant Voice Cloning by ElevenLabs.

Use cases

  • Connecting - Meeting new family members: Andrej relies on his fiancée to translate whenever he has spent time with her Romanian parents. Andrej would have loved to call her father to ask for permission to marry her.
  • Travel - Preparing a mountain route in the Atlas: Jaime is preparing a trip to the Moroccan Atlas to climb a mountain, and he relies on English and French to speak to the Amazigh guide prior to arrival. Jaime would have liked to tap into the wealth of knowledge this guide has of the mountain, rather than a simple and generic message.
  • Networking - Connect with people & knowledge sharing: Nonso is building the new interface for the Middle East version of the Personal Banking App, and he relies on people that translate content into Arabic. Nonso would have liked to access the knowledge and concepts of Product Managers that focus on Islamic Finance as they operate in Arabic.
  • Education & Research- Access lectures and seminars: Marina cannot attend the most advanced medical classes during her 6-month secondment in China, but she doesn’t speak Mandarin. ### Aspirational use cases
  • Diplomacy - UN or COP: Diplomats at international conferences rely on translators to establish consensus, imagine them communicating directly.
  • Migration - Welcoming refugees and asylum seekers: Abi welcomes refugees from all over Africa to the reception centre in Lampedusa; she needs to communicate with them to be able to register them and keep a record of their arrival so that they can have their rights as people fleeing conflict, climate catastrophes and discrimination. Abi doesn’t speak English, nor does she speak Arabic or Ahmaric, and so feels a deep frustration when she relies on the translator's availability.

Product breakdown

  • Live Translation - Speech-to-text & back: This is the core functionality that unlocks everything else. The tool converts speech-to-text (this needs to be streamed), then processes the text and translates it (stream it back), to then convert the text-to-speech via the ElevenLabs tool.

  • Personalised - Voice Cloning & more: By enabling the Instant Voice cloning feature of ElevenLabs, and enabling integration with tools such as Substack, WhatsApp, and Facebook, the tool can learn your context as well as your voice. This will make interactions more personal and more human, despite not being translated.

  • Portable - Enmerkar App: Mobile-first app means that it can integrate with your most common patterns of behavior regarding calling, in addition to also fitting wherever you go, especially when you travel. The whole live translation has been designed to take place within the app. The output audio can be another Enmerkar user, or a phone number via a Twilio integration.

  • Keeps learning - Feedback loops: The more you interact with the Enmerkar app, the more it learns about you and about the people that you want to interact with.

How we built our POC

[See Deck for Visual components]

The POC that was put together has been put together cloning Jaime’s voice, and has been optimised for Spanish to English translation as it was what we could test efficiently within our team. Most of the time was spent cracking the streaming of input and output of voice. The core functionality was the focus of the POC, choosing to not work on the other elements. The elements that were discarded were the onboarding flow, the integrations and the interfaces. The function of creating a stream where input is in one language, and output is in another was the central piece.

Architecture Components

  • Mobile App (Flutter/React Native)– Purpose: The client interface where users interact with the Enmerkar app. Primary Functions: Sends and receives voice data, displays translations, manages user settings, and handles onboarding.
  • Voice-to-Text Processing– Technology: Websocket connection with Rev.ai Function: Receives raw audio input from the app, converts it to text, and sends the transcribed text back to the app for further processing.
  • Translation– Technology: GPT-4 (fine-tuned for translation) Function: Receives transcribed text and translates it into the target language. Returns the translated text to the app or passes it on to the text-to-voice component.
  • Text-to-Voice Processing– Technology: ElevenLabs Turbo 2.5 (Multi-language) Function: Converts translated text into synthetic speech using a voice similar to the user’s or one selected during onboarding. Returns audio back to the app.
  • [Not attempted] Experience Enrichment and Contextual Integration– Scraping Integrations: Pulls contextual data (e.g., from sources like Substack, WhatsApp, Facebook) during onboarding or whenever relevant, using APIs or scraping tools. Context Processing: Processes the collected data to understand user preferences, commonly used terms, and conversational style using GPT-4, to personalise translations and voice output.
  • Voice Cloning for Personalization– Technology: Instant Voice Cloning (likely provided via ElevenLabs or similar) Function: Creates a synthetic voice that mimics the user’s natural tone and inflection, making conversations sound more authentic.
  • [Not attempted] Feedback Loop and Learning– Purpose: Tracks user interactions and feedback on translation quality and voice accuracy. Continuously improves the model and personalises the experience further. Feedback Storage: Stores data for continuous learning, allowing GPT-4 to adapt over time based on usage.
  • [WIP] Backend Server– Function: Manages WebSocket connections, authentication, user data, contextual data storage, and feedback processing. Components: Database: Stores user preferences, context data, and feedback. WebSocket Manager: Manages real-time data flow between the mobile app and services like Rev.ai and ElevenLabs. API Gateway: Provides secure access to external APIs, manages requests to Rev.ai, GPT-4, ElevenLabs, and scraping tools.

Challenges we ran into

There were challenges around the latency, the sizing of the chunks and the interaction with ordering (ie a larger chunk of text would take longer to translate than a smaller chunk of text that may have been recorded after, meaning the output would render a nonsensical phrase).

Accomplishments that we're proud of

In a weekend we explored the complexities behind building such a tool, and we were inspired by the countless use cases that make the concept so powerful.

What we learned

This is a very hard challenge to crack, technically and interactively. You need a lot more time, expertise and resources to meaningfully address this challenge. There was quite a lot of complexity in building the streaming with AI tools that supported this. This was important so that we could ensure continuous flowing of speech, text and their respective translation.

Built With

Share this project:

Updates