Inspiration

Have you ever stood in front of a beautiful historical landmark, wanting to understand its story, but ended up scrolling through long, boring articles instead?

We realized that while modern travelers seek deeper cultural connections, existing tools are slow, fragmented, and uninspiring. We wanted to create something more immersive—an experience that feels like having a cinematic tour guide in your pocket. That’s how VietStory Lens was born.

What it does

VietStory Lens is a multimodal AI application that transforms real-world landmarks into engaging, real-time stories.

Users simply take a photo of a historical site, and within milliseconds, the app recognizes it and begins streaming a rich, AI-generated narrative with lifelike voice narration.

It’s not just recognition—it’s an instant, immersive storytelling experience.

How we built it

We designed an end-to-end multimodal RAG pipeline to deliver fast, accurate, and scalable results.

We used CLIP to generate image embeddings and stored them in Zilliz Cloud for ultra-fast semantic search. For content generation, we integrated OpenAI to produce contextual narratives and ElevenLabs for real-time voice synthesis.

To eliminate latency, we implemented a streaming architecture using Server-Sent Events (SSE), while leveraging caching layers to reduce cost and response time. The entire backend was deployed using FastAPI with scalable cloud infrastructure.

Challenges we ran into

One of the biggest challenges was synchronizing real-time text generation with audio playback. While text streams quickly, high-quality voice synthesis introduces delays, leading to potential desynchronization.

We solved this through asynchronous processing and careful state management to ensure a smooth, aligned user experience.

Additionally, transitioning from a local vector database to a cloud-based system required rethinking our retrieval logic and optimizing for performance at scale.

Accomplishments that we're proud of

Successfully integrated multiple AI systems into a seamless pipeline Built a high-quality, low-hallucination dataset for Vietnamese landmarks Delivered a product that is both technically robust and emotionally engaging

What we learned

Building a real-world GenAI application requires much more than calling APIs. It demands strong data pipelines, efficient retrieval systems, and scalable infrastructure.

We also learned that combining vision, text, and audio creates a far more powerful and emotional user experience than any single modality alone.

What's next for VietStory Lens

We plan to expand beyond Ho Chi Minh City to cover major landmarks across Vietnam.

Next, we will introduce multilingual support to serve international travelers and integrate augmented reality to overlay historical views directly onto the real world.

Our vision is to make history accessible, immersive, and alive—anywhere in the world.

Built With

Share this project:

Updates