Project Overview

We built a voice-driven multimodal web app that transforms speech into intelligent, context-aware interactions using Fish Audio, Fetch.ai uAgents, Letta AI, and ChromaDB my. The system performs both speech-to-text and text-to-speech, retrieves related image or text data from embeddings, and generates emotionally rich, narrated responses, all through a modular agent-based architecture.

Inspiration

A friend of ours who’s blind once said something that really stuck with us. We were all sitting in a room, scrolling through photos and memories from one of our past trips, when she said, “I have so many pictures people have taken for me, but I don’t know what they look like.”

That hit us: for most of us, photos are how we relive moments. But for someone who’s blind, those images are silent. While there are many existing tools to help with everyday tasks like navigation and communication, we realized there aren’t many tools to help with the emotional losses associated with blindness.

We wanted to change that to give those photos a voice, to turn visuals into sound, emotion, and story.

That’s why we built Synesthesia: an AI system that uses digital synesthesia, a biological process of one sense triggering an experience in another sense, to help the blind and visually impaired users feel and hear the memories they can’t see.

What it does

Synesthesia transforms a photo into an expressive audio narrative, describing not just what’s in an image, but what it feels like to experience the moment captured.

The product can be used by speaking about a specific memory. Then, by using text-to-speech technologies, the system finds a specific photo from their collection. The prompts can include a range of items: past trips, locations, dates, descriptions of settings and people, and a myriad of other descriptors that we often associate with our past. Example prompts include “Paris trip last year” or “studying with friends at the library.”

From there, Synesthesia performs a speech-to-text conversion, after which it searches the user’s camera roll for a matching photo. It used OpenAI’s CLIP model to perform semantic search, as well as ChromaDB’s retrieval system, to match using both qualitative descriptions (adjectives, emotions, etc.) and concrete descriptions (dates, locations, etc.) from the prompt.

Then, the system uses agents to generate a storyboard. This includes two steps: perception: analyzes the photo’s objects, colors, and lighting to infer emotion and context; and narration: generates a vivid, human-like story with voices and sounds, letting the user hear their memories.

How we built it: Core System Design

Voice Transcriber Agent (Entry Point) This serves as the first contact in the agent network, responsible for converting user speech to text in real time using Fish Audio STT. It captures the raw audio stream, processes it for clean transcription, and forwards the resulting text to the Coordinator Agent, enabling a seamless voice input layer that triggers the rest of the intelligent pipeline.

Coordinator Agent (Task Orchestrator) This receives transcribed text or visual data requests and decides which agents to engage. This agent also routes communication efficiently between the perception, emotion, and narration agents.

Perception Agent (Semantic Understanding) This agent analyzes photo and textual inputs with CLIP embeddings to extract high-level meaning. It interacts with the Chroma persistent DB, allowing retrieval of semantically similar media or concepts from stored embeddings.

Emotion Agent (Context and Tone) This determines the emotional tone or intent behind the input. This tailors the response style (empathetic, neutral, expressive) by working with Letta AI for sentiment and context mapping.

Narration Agent (Response Composer) This agent synthesizes final natural-language responses based on the insights from the perception and emotion layers. It also used Letta AI to generate coherent, story-like replies or narrations suited to the detected emotional tone.

AI Reasoning Layer (Letta AI Cloud) Each agent consults Letta AI’s cloud reasoning capabilities for deeper linguistic understanding and decision logic. It offloads complex NLP tasks, allowing lightweight agent processes to operate in parallel locally.

Knowledge and Embedding Layer Multimodal data is stored in the Chroma database as CLIP embeddings, allowing scalable vector-based retrieval across image and text domains. This enables context-driven photo recall and cross-modal linking between user queries and stored content.

Voice Interface and Frontend Once the Narration Agent produces text, the Voice Transcriber Agent or a dedicated voice output module uses Fish Audio TTS to vocalize it. The frontend—styled with Tailwind CSS—serves as the clean, responsive user interface displaying both visual retrievals and voice interactions.

Challenges we ran into

This was our first time building a full-stack application under such a tight time frame, which made it both exciting and intense Specifically, one of the biggest challenges was designing agent-to-agent communication, from getting multiple Fetch.ai and Letta agents to coordinate smoothly, share state, and manage asynchronous messaging across several APIs. Debugging timing and data flow between them took a lot of iteration and careful orchestration.

Another challenge, and one of the most fun to experiment with, was emotion detection from static images. We wanted the narration to sound human and emotionally expressive, not robotic. Finding the right balance between visual cues, color psychology, and tone modeling pushed us to think creatively about how AI interprets feeling than just plain content

Accomplishments that we're proud of

Built a complete, voice-driven multimodal web app. Tackled a real accessibility challenge: helping visually impaired users experience old photos through intelligent sound and narration. Integrated multiple cutting-edge systems including Fish Audio, Fetch.ai uAgents, Letta AI, and Chroma embeddings into a seamless agent-based architecture. Engineered modular agents for transcription, coordination, perception, emotion, and narration each communicating autonomously. Delivered a working end-to-end pipeline that converts speech → intelligent understanding → emotional narration → speech output. Collaborated efficiently under time pressure, demonstrating strong teamwork, rapid prototyping, and creative problem-solving. Created a polished, responsive web interface using Tailwind CSS, enabling accessible voice and visual interactions.

What we learned

Learned to interface with diverse AI tools and APIs, including: Fish Audio for speech-to-text and text-to-speech Letta AI for reasoning and language understanding Fetch.ai uAgents for distributed multi-agent orchestration CLIP + Chroma for multimodal embeddings and semantic retrieval Discovered the power of modular, agent-based design, where each component operates autonomously yet collaborates intelligently. Gained experience building full-stack AI apps, from backend orchestration to frontend UX. Learned to debug, deploy, and integrate cloud and local AI services under tight time constraints. Engaged with industry mentors and workshops, learning directly from professionals about real-world AI deployment. Experienced hackathon culture first-hand. Felt empowered and inspired to keep building ambitious, meaningful AI-driven solutions in the future.

What's next for Synesthesia

We would love to add more agents for more advanced search using ChromaDB and Letta, as well as multimodal generation capabilities, including custom voice generation using Fish’s capabilities, coordinating them using Fetch.ai’s agent system.

Built With

Share this project:

Updates