Inspiration
The world is overwhelmingly visual. For the millions of people who are visually impaired, navigating daily life presents constant challenges that many of us take for granted. We were inspired by the recent explosion in multi-modal AI and wanted to channel it into something more than a novelty. We didn't just want to build a tool; we wanted to build a companion. Our inspiration was to create a conversational assistant that could act as a friendly, helpful pair of eyes, allowing users to simply ask about the world and get a natural, human-like response, fostering independence and confidence.
What it does
Aura is a conversational AI vision assistant. In its current demo form, a user can access a simple web portal to upload an image of their surroundings—be it a room, a document, or a product on a shelf. They can then ask a question in plain English, such as "What's on the desk?" or "Can you read me the third line of this letter?".
Our powerful backend orchestrates multiple AI services to understand the image in the context of the question and speaks back a clear, helpful description in a natural voice. Because our system is session-aware, users can ask follow-up questions about the same image, creating a true, helpful dialogue, not just a one-off analysis.
How we built it
Aura is built on a sophisticated, multi-modal backend and a clean, accessible front-end.
The Backend (The Brain): We engineered a robust server using Python and FastAPI. It manages a session-based architecture to handle conversational context. We used the Vapi Server SDK to orchestrate the voice conversation, the Google Gemini 1.5 Flash API for its state-of-the-art image analysis, and Google Cloud's Text-to-Speech API to generate a natural, high-quality voice for Aura. All API keys and credentials were securely managed using environment variables.
The Front-End (The Body): We initially targeted the Meta Quest 3S for a fully immersive, hands-free experience, building a native app in Unity. However, to ensure a stable and polished demo for the hackathon, we executed a strategic pivot to a universally accessible web application built with standard HTML, CSS, and JavaScript.
Infrastructure: We used
gitand GitHub for version control throughout the project and ngrok to expose our local development server for live, real-time testing between the front-end and back-end.
Challenges we ran into
Our biggest challenge was a classic hackathon story of ambition versus the clock.
Our primary technical hurdle was on the front-end. We were incredibly excited about the potential of the Meta Quest 3S, but as beginners to the platform, we ran into blocking issues with the Unity development environment and a slow build-deploy-test cycle. Making the tough but decisive call to pivot our entire front-end to a web application with only hours to spare was a major challenge that required rapid re-planning and intense, focused execution.
We also wrestled with dependency management on the backend, specifically finding the correct, compatible version of the Vapi Python SDK for our modern Python 3.12 environment, teaching us a valuable lesson in debugging package ecosystems.
Accomplishments that we're proud of
First and foremost, we are incredibly proud of executing a successful, high-pressure pivot. Instead of giving up when our primary hardware platform failed, we adapted, re-planned, and built a fully functional web demo in just a few hours.
Technically, our biggest accomplishment is the sophisticated, session-aware backend architecture. It's not just a simple script; it's an event-driven system that can handle conversational context, making the user experience feel genuinely interactive.
Finally, we're proud of building a complete, end-to-end multi-modal AI pipeline—successfully orchestrating image uploads, voice processing, advanced AI vision analysis, and text-to-speech generation into a single, seamless user experience.
What we learned
Technically, we learned an immense amount about the practicalities of building a voice-first AI system. We gained hands-on experience with FastAPI, the Vapi SDK, the Gemini API, and the complexities of managing asynchronous operations and state in a real-time application.
The most important lesson, however, was in agile decision-making. We learned that in a time-constrained environment, knowing when to persevere and when to make a strategic pivot is the most critical skill. This hackathon drilled home the importance of clear communication and teamwork, especially when facing unexpected roadblocks.
What's next for Aura: AI-Powered Eyes for the Blind
The web application is a fantastic proof-of-concept, but it's just the beginning. Our vision for Aura is clear:
Complete the Native App: Our immediate next step is to solve the development hurdles on the Meta Quest 3S. We want to deliver the truly immersive, hands-free, "always-on" assistant that we first envisioned.
Enhance Conversational Memory: We will evolve the session management to include long-term memory, allowing Aura to remember about a user's common environments (like their home or office) and personal preferences.
Real-Time Video Analysis: The ultimate goal is to move from static images to analyzing a real-time video stream. This would transform Aura from a reactive tool into a proactive co-pilot, capable of warning users about upcoming obstacles or describing events as they unfold.
Log in or sign up for Devpost to join the conversation.