Inspiration
We started with a simple thought: “What if AI could be more than smart — what if it could be a companion?” That question stayed with us when we thought about people who experience the world differently, especially those who are visually impaired. For them, something as ordinary as crossing the road or walking into a room can feel uncertain. We wanted to build something that doesn’t just describe the world, but does it with warmth — a hand to hold, a voice to guide.
What it does
That’s how Companion AI was born. Using computer vision and text-to-speech, it looks at the world through a camera and narrates what it sees in real time: “A man is crossing the road holding a bag.” It turns vision into voice, so no one has to miss the story happening around them.
How we built it
We began by connecting three worlds: vision, language, and speech. First, we used a Vision-Language Model to “see” and generate captions for video frames. Then, we broke videos into images so the AI could describe them in real time. Finally, we gave those captions a voice through text-to-speech. Piece by piece, we stitched these elements together until the AI could look at a scene and narrate it instantly — almost like giving sight a voice.
Challenges we ran into
The hardest part was speed. Narration only feels useful if it happens in real time, so we had to balance accuracy with quickness. Another challenge was simplicity — an AI can describe a lot, but too much information overwhelms the listener. We had to teach the system to speak clearly, not endlessly. Integrating different tools smoothly also tested our patience, but step by step, the pieces began to flow together.
Accomplishments that we're proud of
Even though Companion AI is still in its early stage, we’re proud of the progress we’ve made in bringing the idea to life. We managed to stitch together the core pipeline — from capturing frames, to generating captions, to hearing the first text-to-speech output. It might not be perfect yet, but hearing the AI describe even a simple scene felt like proof that this could really work. Most of all, we’re proud of the vision itself: building something that puts empathy at the heart of technology.
What we learned
We learned how to bridge computer vision, natural language, and speech into one seamless experience. But more importantly, we learned that accessibility is about more than technology — it’s about empathy. Designing for people who truly need this reminded us why we build: not for the code itself, but for the lives it can touch.
What's next for Companion Ai
This is only the beginning. We imagine Companion AI becoming multilingual, so it can narrate in the user’s own language. We want it to be context-aware — not just “a person,” but “your friend is waving at you.” And one day, we see it built into glasses or earphones, offering hands-free support everywhere. Our dream is simple: to make the world feel a little more inclusive, one voice at a time.
Log in or sign up for Devpost to join the conversation.