Project Story

🌟 Inspiration

In a world saturated with passive "Chatbots," I wanted to build a true Agent. With over 20 years of experience in IT and living in Al Ruwais, Qatar, I realized that the gap between digital tools and physical reality is still wide. I was inspired to create Aura—a proactive, vision-enabled Agentic AI that doesn't just wait for instructions but actively observes, learns, and initiates action. Whether it's a child struggling with a science book or a professional needing a cognitive partner, Aura is designed to be the bridge.

🎯 What It Does

Aura is a Modular Agentic AI that can switch its entire cognitive framework based on the user's needs.

  • Current Implementation (Teaching Mode): Aura acts as a proactive Science Tutor. By "seeing" through the camera, it recognizes diagrams (like the Earth's orbit or biological structures) and initiates educational dialogues without being prompted.

  • Multimodal Interaction: It handles real-time vision and high-fidelity audio, allowing for natural "Barge-in" conversations where students can interrupt and interact as they would with a human.

  • Cognitive Mode Switching: As seen in our advanced UI, Aura is built to transition between specialized modes: Teaching, Field, Dev/Tech, and Biz/Ops. Each mode transforms Aura into a specialist capable of handling different workflows.

🏗️ How We Built It

Aura's architecture is built for speed and modularity:

  • The Core: Powered by the Gemini 2.0 Flash Live API for sub-second multimodal responses.

  • The Infrastructure: A robust Laravel backend paired with a reactive Vue.js + Pinia frontend.

  • Agentic Architecture: We implemented a modular system where Aura can update its system instructions and grounding data on the fly. This allows it to ingest verified external sources to ensure that its advice—whether in "Teaching Mode" or "Dev Mode"—is always accurate and documented.

  • The Pipeline: We engineered a custom WebSocket pipe that streams compressed JPEGs for vision and handles raw binary PCM audio data.

🚧 Challenges We Ran Into

The "Experimental" nature of the Live API presented significant engineering hurdles:

  • Binary Byte Alignment: We solved the RangeError in Int16Array constructors by enforcing strict byte-alignment for the raw PCM stream:

$$L_{safe} = L_{total} - (L_{total} \pmod 2)$$

  • Signal Normalization: To play the audio in the browser's AudioContext, we had to normalize the signed 16-bit integers:

$$f(x) = \frac{x}{2^{15}} = \frac{x}{32768.0}$$

  • The Base64 Trap: We had to bypass standard text-decoding layers and build a pure Binary Pipe to avoid InvalidCharacterError and ensure low-latency audio delivery.

🏆 Accomplishments That We're Proud Of

  • Successful Proactivity: Achieving a system that "senses" silence and visual context to lead a conversation.

  • Modular Multi-Agent UI: Designing a futuristic interface that supports switching between cognitive modes (Field, Dev, Biz).

  • Mastering the Live Stream: Building a stable, full-duplex communication channel that manages vision and audio simultaneously without saturation.

🧠 What We Learned

We learned that the future of AI isn't in larger models, but in Agentic Orchestration. Managing the "State" of an AI that sees and hears in real-time taught us about the importance of low-latency data handling and the power of "Implicit Prompting" (using visual cues as prompts).

🚀 What's Next for Aura

Aura is evolving from a Science Tutor into a Universal Cognitive Assistant.

  • Phase 2: Activating the Dev/Tech and Biz/Ops modes using documented external APIs and specialized knowledge bases.

  • Grounding on Verified Sources: Integrating RAG (Retrieval-Augmented Generation) so Aura can pull from technical manuals or professional documents in real-time.

  • Edge Integration: Moving Aura into wearable devices (like smart glasses) so it can provide "Agentic" help in the field, literally seeing what the professional sees.

Built With

Share this project:

Updates