Architecture Diagram

What motivated you to start this project during the competition?

I believe that the true value of artificial intelligence (AI) is not to replace people, but to expand what they can do, feel, and how they can live. Thus, emerging technologies possess the power of inclusion, autonomy, and empowerment, guaranteeing a magical, transformative experience for their users. Based on this premise, I created Vibe Check specifically to participate in the Gemini Live Agent Challenge Hackathon on the Devpost platform.

Seeking inspiration from Google applications, I discovered Lookout, an incredibly accurate and functional accessibility tool. I watched some demonstrations on YouTube and admired how computer vision and generative AI are being used to help blind or visually impaired people perform tasks more quickly and easily.

Researching the topic, I found a study published in the medical journal The Lancet Global Health in 2017 that estimated there were approximately 36 million blind people in the world at that time [1]. At that moment, I began to imagine how I would develop a solution capable of redefining the user's interaction with the world around them through an immersive experience, exploring the power of Google's extraordinary multimodal AI in the Gemini Live Agent Challenge in the Live Agents category using the Gemini Live API.

So, how can the application of a rich environmental narrative (Environmental Storytelling - via prompt engineering), combined with the Clock Face Method, generate a sensory immersion capable of evoking real feelings?

Thus, Vibe Check was born — a multimodal AI agent capable of seeing, hearing, and speaking to help blind or visually impaired people as an Orientation and Mobility assistant, using empathetic storytelling to evoke emotions.

[1] Bourne, RRA ∙ Flaxman, SR ∙ Braithwaite, T ∙ et al. Magnitude, temporal trends, and projections of the global prevalence of blindness and distance and near vision impairment: a systematic review and meta-analysis Lancet Glob Health. 2017; 5:e888-e897. Disponível em: https://www.thelancet.com/pdfs/journals/langlo/PIIS2214-109X(17)30293-0.pdf. Acesso em: 13 mar. 2026.

What does Vibe Check do?

Using the application is easy: click “CLICK TO START”, accept the permissions (microphone, camera, and geolocation access) — done!

Vibe Check uses the Clock Face Method, a classic Orientation and Mobility (O&M) technique used by blind or visually impaired people, which maps the environment using the face of an analog clock as a spatial reference to ensure safe physical navigation. Its value proposition lies in applying the “Vibe Experience” concept through a rich environmental narrative to generate sensory immersion, seeking to evoke real feelings in the user. That is, instead of simply labeling objects, such as: Door at 12 o'clock. Table at 12 o'clock. Sofa. Table at 12 o'clock. Picture frame at 12 o'clock. Door at 12 o'clock. Imagine something like:

"Your path is not clear at noon. About a meter away, a large golden retriever is lying on the rug (note: without the user needing to specifically point the camera at each object to capture the spatial environment, they have already been alerted to a rug sliding and a dog in front of them). Behind the dog at twelve o'clock, about a meter and a half away, there is a sofa. Two men are sitting on this sofa conversing in a cozy and casual atmosphere. The man on the left at eleven o'clock speaks and gestures. While the man on the right, at one o'clock, holds a mug and listens. There is a side table at nine o'clock and another at three o'clock, both with lamps and plants. The space is well-lit and inviting. Can you confirm that you are not moving now?"

There's a demonstration of this working waiting for you here: https://youtu.be/rgojPKV4Le4

How was Vibe Check built?

Vibe Check was built using Google's multimodal AI model, gemini-live-2.5-flash-native-audio. With it, you don't need, and shouldn't use, separate models for TTS (gemini-2.5-pro-tts) or web search (gemini-2.5-pro), as its main features are: Native/proactive audio; Audio transcription; Voice activity detection; Affective computing; and Tool usage (web search). This made it the perfect orchestrator for what was defined as "Google Gemini Intelligence" in the application's data flow.

The goal is to ensure continuous, real-time interactive agent narration at the conversational level while web searches in verified sources, searching for nearby places via geolocation, or reading texts (including upside-down handwritten text – demonstrated in real-time in the Vibe Check presentation, watch at: https://youtu.be/rgojPKV4Le4).

The frontend used HTML5, CSS3, and JavaScript to create the responsive user interface. Web Audio API was used for audio processing and playback with a custom jitter buffer, along with WebRTC/MediaDevices API for camera and microphone capture with direct access to the user's hardware. Additionally, the Canvas API was used for video frame extraction and compression for streaming, and the Geolocation API to access real-time coordinates from the user's location for precise geographic contextualization of places actually near their coordinates.

The backend was developed in Python with FastAPI to ensure high-performance asynchronous web interaction, implementing WebSockets for managing persistent connections with the Gemini Live API for real-time bidirectional communication via the WSS protocol and the Google Gen AI SDK for accessing Gemini models. Uvicorn was used as the production ASGI server and Jinja for rendering the HTML template.

I completed the application deployment on Cloud Run, using a serverless Docker container architecture with auto-scaling configured according to these specifications:

gcloud run deploy app-name \
--source . \
--region us-central1 \
--revision-suffix="app-name" \
--allow-unauthenticated \
--memory=8Gi \
--cpu=4 \
--concurrency=20 \
--min-instances=2 \
--max-instances=20 \
--no-cpu-throttling \
--session-affinity \
--timeout=3600 \
--execution-environment=gen2 \

Note: I chose to keep instances that never fully "sleep," and this generates a fixed cost to avoid cold starts in AI models, but if your goal is to save as much as possible, set the "--min-instances" flag to zero (--min-instances=0).

What challenges were faced during the journey?

The biggest challenge was dealing with stuttering in communication, as micro-stuttering removed important message characteristics and broke the fluidity of the "Vibe" experience, which was adopted as the project's value proposition. Therefore, the use of a custom 180ms Jitter Buffer solved the problems caused by variable network latency.

Furthermore, since I chose to use prompt engineering to guide the model, its instructions became the heart of the application, and it took a long time to achieve a satisfactory result. A mere engaging description was discarded, as the agent needed to act specifically, using the Clock Face Method to position itself as an Orientation and Mobility (O&M) assistant for blind or visually impaired people as an absolute rule.

Implementing visual functionalities to prove the tool's use in real time was not planned in the final version and took a lot of time during the process, but the comfort I felt evaluating the agent through the improvised panel was so great that I thought it would be interesting to share this vision with the community.

Vibe Check was conceived from the beginning as a feature to enhance accessibility in applications, and its first version was simply a simple, clickable button for the user to launch the application. However, to better showcase its potential, it evolved into a project that would become a successful startup.

What achievements are you proud of?

I am amazed when speakers use self-description as an accessibility practice so that visually impaired people understand who is speaking, creating a more inclusive connection at events such as hackathons, academic lectures, and innovation hubs.

Honestly, I had already given up on the project about fifty times because I couldn't achieve coherent adaptation in the agent's narrative, so naturally, I was overcome with emotion at the final result achieved. After deploying the application, I tested the context cases suggested in the technical documentation (available at: https://github.com/FredSRocha/Gemini-Live-Agent) and they performed much better than on my machine. I took a deep breath and said to myself: Wow!

Vibe Check is also a manifestation of how artificial intelligence can work hand in hand with the UN's 2030 Agenda for Sustainable Development, mainly through the following SDGs: Goal 10. Reduce inequality within and among countries - target: 10.2. By 2030, empower and promote the social, economic and political inclusion of all, irrespective of age, gender, disability, race, ethnicity, origin, religion, economic or other status [2]; Goal 11. Make cities and human settlements inclusive, safe, resilient and sustainable - target 11.2: By 2030, provide access to safe, accessible, sustainable and affordable transport systems for all, improving road safety through the expansion of public transport, with special attention to the needs of those in vulnerable situations, women, children, persons with disabilities and older persons [3]; and more broadly, to Goal 3. Ensure healthy lives and promote well-being for all at all ages [4]: hypothetically assuming that traditional navigation can generate stress and the concept of “Vibe Experience” would mitigate anxiety and fear of the unknown through its welcoming narrative with safe guidance.

[2] UNITED NATIONS (UN). Sustainable Development Goal 10: Reduced inequalities. United Nations Brazil. Available at: https://brasil.un.org/pt-br/sdgs/10. Accessed on: March 13, 2026. [3] UNITED NATIONS (UN). Sustainable Development Goal 11: Make cities and human settlements inclusive, safe, resilient and sustainable. United Nations Brazil. Available at: https://brasil.un.org/pt-br/sdgs/11. Accessed on: March 13, 2026. [4] UNITED NATIONS (UN). Sustainable Development Goal 3: Ensure healthy lives and promote well-being for all at all ages. United Nations Brazil. Available at: https://brasil.un.org/pt-br/sdgs/3. Accessed on: March 13, 2026.

What did you learn from the challenges?

I learned not to give up, and also that using a real-time multimodal agent is radically different from text-to-text interaction, mainly because of its comfortable and fluid usability, which offers a delightful experience.

The prompt engineering for these agents must be more carefully crafted, as the long prompts were responsible for misinterpretations with generic interpretations and required a laborious rereading of all instructions to identify the problem. In fact, this was my biggest fear at the beginning of the project: misinterpretations and biases.

Furthermore, experimenting with the Google Cloud ecosystem was exceptional in executing the real-time flow manipulation of the application, which performed much better than in my local environment, and I simply loved it!

What's next for Vibe Check?

Vibe Check was conceived as a proof of concept to demonstrate a particular vision for applying AI as a form of inclusive empowerment, but it ended up achieving a functional prototype with features that enabled real user interaction with the world around them, demonstrating how the final product can perform well thanks to the excellent execution environment in Cloud Run. I fully believe in the potential of this proposal as a future innovative feature in accessibility applications like Google Lookout.

Regarding new features, I would adopt a cardinal point system in well-lit outdoor environments and an environment memorization mode to avoid requests in commonly visited environments identified by the user as a place without physical changes to the environment. Furthermore, I would work on unique language-specific fine-tuning so that possible integration with accessibility devices would be almost independent, such as smart canes that do not operate by user voice command but emit vibrations for specific alerts using the model's vision and converse with the user without receiving instructions from them. Hypothetically, this would be able to convey a feeling of unconditional support, innovating as an applied autonomous functionality.

On the other hand, it is essential that this proposal be made entirely available to the community for various tests in diverse scenarios before any illusory expectations arise about a truly useful and innovative solution. The community of blind or visually impaired people should be the ones to answer this question.

THANK YOU!

GeminiLiveAgentChallenge

Comment on how the application of a rich environmental narrative (Environmental Storytelling), combined with the Clock Face Method technique, can generate a sensory immersion capable of arousing real feelings in the user? (This question was the origin of everything for me throughout the work developed in the Gemini Live Agent Challenge Hackathon through the Devpost platform).

Built With

cloud-run
css
fastapi
google
html
javascript
python

Updates

Fred Rocha started this project — Mar 16, 2026 07:23 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.