OmniAi: The Perception-First Personal Agent

Inspiration

Most AI tools share an invisible assumption: they wait. You send a message, they respond. Even the best voice assistants follow this rhythm — a command fires, a reply comes back. There is always a pause, and that pause communicates something fundamental: the AI is not truly present with you. It is a very sophisticated vending machine.

We wanted to break that pattern entirely.

The inspiration for OmniAi came from thinking about the moments in life when you genuinely need someone to be watching over you — not someone you have to summon, but someone already there. A grandmother squinting at the tiny text on a pill bottle late at night, not sure if she's holding the right medication. A person navigating an unfamiliar city alone after dark, wishing someone dependable had eyes on the situation. A freelancer staring at a contract clause in a language that almost sounds like English but doesn't quite make sense. Someone in a restaurant suspecting the bill doesn't add up, but feeling too awkward to pull out a calculator in front of the waiter.

None of these people want to "open a chat interface." They just need a presence. An intelligent companion that sees what they see, hears what they hear, and quietly says: "Hey, look at this."

That is OmniAi. Not an AI assistant. A Digital Life Companion.

What It Does

OmniAi is a real-time, multimodal AI agent that runs on iOS and Android. Unlike traditional voice assistants that process one request at a time, OmniAi operates as a continuously perceiving companion — streaming live camera vision and microphone audio directly into a Gemini 2.5 Flash Native Audio model, and speaking back in natural voice with zero transcription latency.

The agent operates across 12 distinct specialized modes, each representing a different "lens" through which it engages with the world:

Surroundings Narrator — For visually impaired users, OmniAi provides continuous, real-time audio descriptions of whatever the camera sees: objects, text, people, hazards. No prompt needed. Just open the mode and it speaks.

Pill Identifier — Point the camera at any medication. OmniAi reads the imprint, identifies the drug by color, shape, and markings, names it, and explains what it is, what it treats, and any critical warnings. It dynamically switches to high-resolution camera capture for this mode to ensure accurate reading of imprint codes.

Bill Auditor — Hold a receipt or restaurant check up to the camera. OmniAi reads every line item, verifies the arithmetic, checks whether service charges match what was disclosed, and calls out any discrepancies — calmly and immediately.

Scam Shield — A passive monitoring mode that listens to call audio and watches shared screen content for the hallmarks of fraud: manufactured urgency, coercive language, requests for gift cards or wire transfers, countdown timers, threatening language. When it detects manipulation, it alerts you.

Contract Reader — Feed it a legal document via screen share or camera. OmniAi identifies unusual clauses, explains what complex legal language actually means in plain terms, and flags provisions that reduce your rights or protections.

Form Coach — Watch yourself work out. OmniAi analyzes your exercise posture in real time through the camera, counts reps, identifies form issues ("your elbow is dropping on the press"), and coaches you through sets with live spoken feedback.

Price Hunter — Name an item you want to buy. OmniAi opens its built-in browser, navigates to multiple retailers autonomously, compares prices, reads reviews, and reports back — without you touching anything.

Booking Concierge — Give it a date, a destination, and a preference. It accesses booking platforms through the in-app browser, checks availability, reads options out loud, and can navigate confirmation flows autonomously.

Situational Guardian — A silent environmental monitor. The agent watches and listens without speaking, and only interrupts you — discreetly — when it detects something worth flagging: potential danger, a social situation requiring attention, or context-relevant information. It can request a "private mode" handshake that the user must grant before it speaks freely.

Proactive Storyteller — After any session, OmniAi reflects on what it observed and generates an illustrated narrative: a written report or story, accompanied by AI-generated images (via Vertex AI Imagen) and video (via Vertex AI Veo), all based on your real-world observations. Results are stored and viewable in the Topics hub.

Focus Anchor — Screen monitoring mode. OmniAi watches your shared screen and gently redirects you when it notices you drifting away from your declared task. It knows what you're working on because you told it. It's an accountability partner, not surveillance.

Caregiver Eye — Passive safety monitoring designed for elderly care or child safety contexts. The agent watches a room through the camera and listens for sounds that suggest a fall, an absence of movement for an unusual duration, or signs of distress — and can trigger a notification to a designated caregiver.

Cross-cutting capabilities across all modes:

True Barge-in Interruption: Speak over the AI at any time. It stops instantly, clears its audio buffer, and listens. This works through on-device speaker diarization that distinguishes your voice from the AI's own playback.
Personalized Persistent Memory: Tell OmniAi something once — a preference, an allergy, a name — and it remembers it across every future session. Memory is stored in a database and injected as context at the start of each new session.
Autonomous Webview Agent: A full in-app browser where OmniAi acts as a co-pilot. It can open URLs, tap elements by normalized screen coordinates, scroll, type into form fields, and take screenshots of the current page to verify its actions — all through a structured tool system.
Cross-Platform Screen Sharing: Full screen capture on both iOS and Android, allowing the agent to help with any app on your device in real time.

How We Built It

OmniAi is a full-stack real-time system built across three tightly integrated layers.

The Core Model: Gemini 2.5 Flash Native Audio

The entire project was made possible by Gemini 2.5 Flash's Native Audio capability, accessed via the Gemini Multimodal Live API. Instead of the traditional transcription pipeline (audio → text → LLM → text → speech), the model accepts raw PCM audio and JPEG image frames directly and generates raw audio output — all in one pass. This eliminates the compounding latency of separate speech-to-text and text-to-speech layers and preserves prosodic information that transcription destroys: tone, hesitation, emphasis, emotion.

The voice used across all modes is "Puck," configured via Gemini's PrebuiltVoiceConfig in the RunConfig. Response modality is set explicitly to [types.Modality.AUDIO] via RunConfig.model_construct() using Pydantic's bypass constructor — a necessary workaround to prevent ADK's type coercion from converting the enum to a string, which was causing 1007 WebSocket errors.

The Orchestrator: Google Agent Development Kit (ADK)

All session lifecycle management runs through the Google ADK. The Runner and InMemorySessionService handle session creation, continuity, and reconnects. The LiveRequestQueue is the channel through which all input — audio blobs, image frames, and tool results — enters the model.

A critical architecture detail: each connected Socket.IO session spawns its own Python thread with a dedicated asyncio event loop. The run_loop function inside that thread continuously calls runner.run_live(...) in a while loop. When the LiveRequestQueue is closed (the mechanism for barge-in interruption), the run_live generator terminates, the loop sleeps for one second, creates a fresh LiveRequestQueue, and reconnects — automatically restoring the session for the next turn.

The agent has a full tool roster dispatched by ADK: remember, open_webview, webview_tap, webview_scroll, webview_type, webview_screenshot, request_private_chat, upsert_topic_narrative, generate_image, generate_video, and several medical and research tools. Tool call events arrive as function_calls on ADK events, and are routed in omni_business.py to the correct Python handler.

Persistent Memory

When the remember tool is called, MemoryBusiness.save_user_memory() writes the content to MySQL via SQLAlchemy. At the start of every new session (not every reconnect — only once per logical user connection, tracked by a memories_sent flag that survives reconnects), all stored memories are injected into the LiveRequestQueue as a system-level user message. This means OmniAi truly knows you across sessions without needing any in-context repetition.

The Autonomous Browser

WebviewBusiness maps tool calls from the agent to Socket.IO commands fired at the Flutter client. The Flutter WebviewController executes the actual DOM interactions using flutter_inappwebview. When the agent calls webview_screenshot, the client captures a JPEG of the current browser page and streams it back to the backend as a standard image/jpeg blob via the LiveRequestQueue, grounding the agent's next action in what it actually sees.

Coordinates for taps and scrolls use a normalized (0.0–1.0) coordinate system, computed relative to the visible webview frame, making the tool calls resolution-independent across different screen sizes.

The Mobile App: Flutter

The Flutter app runs the entire perception pipeline concurrently:

Camera frames are captured by a native ImageStream callback, converted from YUV to JPEG in a background Dart Isolate using compute() to avoid blocking the UI thread, and the resulting bytes are sent via the socket.
Microphone audio is captured as raw PCM at 16kHz mono via the record package with hardware echoCancel: true enabled at the capture layer.
Frame and audio streaming are gated by mode — when _isTurnActive is true (the agent is speaking), camera frame transmission pauses to prevent the model from restarting on incoming visual input mid-speech.
Audio playback uses just_audio with a ConcatenatingAudioSource playlist. Incoming PCM chunks are buffered into 48KB segments (approximately 1 second), dynamically wrapped in a WAV header by StaticWavSource, and added to the playlist live. A _playerWatchdogTimer fires every 500ms to catch and recover from edge cases where the player stalls in completed or idle states despite more audio being queued.

The Backend: Python (Flask + Flask-SocketIO)

Flask handles REST endpoints and Socket.IO connectivity. Each session is thread-isolated with its own asyncio loop and LiveRequestQueue. The send_input method uses loop.call_soon_threadsafe() to safely push data from the synchronous Socket.IO thread into the async Gemini session loop.

Audio arrives as base64-encoded PCM from the mobile client, is decoded, wrapped in a types.Blob with mime_type="audio/pcm", and pushed into the queue via queue.send_realtime(). Camera frames arrive the same way, decoded and wrapped as image/jpeg blobs sent via queue.send_content().

Outgoing audio from Gemini arrives as inline_data parts on ADK events. Each chunk is immediately base64-encoded and emitted to the client via Socket.IO's ai_response_audio event with a sequential index field. The client reassembles chunks in order using a _pendingAudioChunks map to handle out-of-order delivery.

Deployment: Google Cloud Platform

The backend runs on a GCP VM instance behind Nginx (WebSocket proxy pass with appropriate upgrade headers) and Gunicorn (gthread workers, 4 threads). Vertex AI handles image generation (Imagen) and video generation (Veo) for the Proactive Storyteller mode. Generated assets are stored in Google Cloud Storage and their public URLs returned to the client.

Challenges We Ran Into

1. The Acoustic Echo Loop

The first full end-to-end test revealed a spectacular failure mode: OmniAi heard its own voice through the phone microphone, recognized the audio as speech, responded to itself, and within fifteen seconds was locked in a self-reinforcing loop of the AI holding an increasingly confused conversation with its own output.

We broke the loop with what we called PCM Ducking. When audio playback is active, a reference-aware system tracks the RMS amplitude of what's being played back (_currentOutputLevel) and uses it to calibrate the expected echo signature — the ratio of microphone capture to speaker output. Microphone input is not literally muted, but the barge-in detection threshold is adjusted upward so that the AI's own voice (producing an echo with a predictable amplitude ratio) doesn't trigger a stop signal. We also enabled hardware-level AEC by configuring the AVAudioSession to .voiceChat mode, which activates the dedicated echo cancellation hardware path on iPhone.

2. Audio Engine Stalls and Reset Storms

Rapid conversational turns — short question, short answer, short question — caused the just_audio engine to get into bad states. A stop() followed quickly by a play() could leave the player in an undefined internal state, producing silence with no error.

We implemented two defenses. First, an Async Modification Queue: all playlist operations (add, clear, pause, play, seek) are chained through a _playlistQueue Future, ensuring they execute serially even when triggered from concurrent event handlers. Second, a Player Watchdog Timer: a periodic timer at 500ms intervals checks whether the player is in a completed or idle state while a turn is still active and pending audio exists in the playlist. If so, it forces a seek to the correct segment index and re-triggers play(). This effectively handles the edge case where the platform's native audio engine reports completion before the next segment is loaded.

We also deliberately use pause() instead of stop() during interruptions — stop() disposes the platform player, which is expensive to reinitialize. pause() keeps the hardware pipeline warm while making it instantly interruptible.

3. The Barge-In (Interruption) Problem

This was the deepest challenge in the entire project, involving three simultaneous systems that had to coordinate within milliseconds.

Detection — When the AI is speaking, the microphone hears a mixture of the AI's audio (via speaker bleed) and any incoming voice. A simple volume threshold would constantly false-positive on the AI's own audio. We integrated sherpa_onnx's SpeakerEmbeddingExtractor for on-device speaker diarization. During the first three seconds of active user speech in a session (before OmniAi sends its first response), we buffer audio and extract a 256-dimensional speaker embedding (using a ResNet34 model trained on VoxCeleb) as a "voice anchor" — a biometric fingerprint of the user's voice. This anchor is persisted to the backend database via an API call, so it survives across sessions. When OmniAi is speaking and a loud audio anomaly is detected, we extract a transient embedding from the current mic buffer and compute cosine similarity against the stored anchor. A similarity score above 0.75 confirms it's the user speaking, and barge-in is triggered. Scores below that threshold are treated as TV audio, background voices, or noise.

Stopping Generation — Simply sending a stop message to Gemini doesn't work. The model ignores it and continues generating. The only reliable mechanism we found to halt output is to close the LiveRequestQueue entirely. This terminates the run_live() generator on the Python side, immediately stopping all audio event emission. The background thread then sleeps one second and creates a fresh queue for the next turn.

Clearing the Buffer — By the time barge-in fires, the backend may have already emitted several seconds of audio that is sitting in the client's _pendingAudioChunks map or _audioBuffer. We implemented a cancel timestamp (_lastCancelTime) that is stamped the moment manualInterrupt() fires. Any incoming audio chunk — including chunk index 0 from a potential immediately reconnected response — is discarded if it arrives within 2 seconds of the cancel timestamp. The playlist is synchronously cleared and the audio player is paused in the interrupt handler.

4. Cross-Platform Screen Sharing

Getting reliable, low-latency screen frame streaming on both iOS and Android required writing platform channel code at the OS level.

On Android, we used the Media Projection API via a foreground service, exposing a frame stream through a platform channel to Flutter. Frame rate and JPEG compression were tuned to balance quality against WebSocket bandwidth — too many large frames overwhelm the socket, too few leave the agent contextually blind.

On iOS, ReplayKit's broadcast extension mechanism requires the capture logic to run in a separate process. Frames from that process are sent to the main app via an App Group shared container and a binary socket. This inter-process communication path requires precise setup of entitlements, app group identifiers, and bundle IDs — all of which are poorly documented in the context of Flutter platform channels.

5. Duplicate Speech / Ghost Audio

A subtle bug caused OmniAi to sometimes repeat the last sentence of a response. The root cause was a double-emission: ADK sends audio as streaming deltas during generation, and then some implementations also buffer and re-emit at turn completion. We traced this through the event handler and confirmed that audio_buffer was being flushed both incrementally (per chunk) and again at turn_complete. The fix was to emit chunks immediately as they arrive and explicitly NOT flush the audio_buffer at turn completion — the buffer exists only to aggregate small chunks into playlist-safe segments (≥48KB), not as a second delivery mechanism.

6. Session Isolation Under Concurrency

Because each Socket.IO session gets its own thread and asyncio loop, and because Flask-SocketIO's event handlers run in a separate thread pool, passing data between them safely required careful use of loop.call_soon_threadsafe(). Any direct call to queue.send_realtime() from a Socket.IO handler thread would corrupt the event loop state. All input delivery is wrapped in a safe_send closure dispatched through the correct session's loop via call_soon_threadsafe.

Accomplishments That We're Proud Of

True, Speaker-Verified Barge-in — The interruption system is the technical achievement we're most proud of. It combines on-device neural speaker embedding (ResNet34/VoxCeleb), reference-aware acoustic fingerprinting, a cancel shield window, and backend stream lifecycle management into a barge-in that is both reliable and user-identity-aware. Background TV audio and other voices don't trigger it. Your voice does.

Zero-Transcription Multimodal Pipeline — By using Gemini's Native Audio model, we eliminated the entire transcription layer from the perception pipeline. Audio in, audio out, vision grounded in real time. The latency reduction is perceptible even on a mobile connection — conversation flows naturally rather than feeling request-response.

Persistent Cross-Session Memory — OmniAi genuinely remembers you. The remember tool → MySQL → session injection pipeline means a user who tells OmniAi their name, preferences, or health information in one session will have that context automatically available in every future session, without re-stating anything.

Autonomous Browser Co-Pilot — The webview tool system — webview_tap, webview_scroll, webview_type, webview_screenshot — creates a genuine browser automation loop where the agent can navigate real websites, read page content through screenshots, fill forms, and verify results. This is not mocked or simulated against a fixed set of sites. It works on any URL.

12 Distinct Operational Modes, One Continuous Session — Switching modes doesn't restart the session, reconnect to the model, or reset memory. It's a context shift within a continuous conversation. The same agent, the same session, the same long-term memory — just a different lens.

Reliable Cross-Platform Screen Sharing — Getting high-framerate, low-latency screen capture working on both iOS and Android, through Flutter platform channels, to a streaming AI pipeline, is a genuinely hard systems problem. It works reliably.

Hardware-Software Audio Synergy — The combination of .voiceChat AVAudioSession mode (for hardware AEC), playAndRecord category with defaultToSpeaker and allowBluetooth options, a 5.0x software gain stage on received audio (tuned through iterative testing on real hardware), and the Pulse Watchdog produces audio output that is clear, loud, and artifact-free on real iPhones in real environments.

What We Learned

Real-time is a hardware problem, not a model problem. Most of the hardest challenges we encountered had nothing to do with AI. They were about audio hardware lifecycle, thread safety, PCM buffer alignment, AVAudioSession category interactions, and native platform channel implementation. Mastering these layers was just as important as any prompt engineering.

The only reliable way to stop Gemini from generating is to close the session. We tried everything else first. Sending stop tokens, sending empty content, sending interrupt signals. None of it reliably halted generation. Closing the LiveRequestQueue was the only mechanism that actually worked, and understanding this completely reshaped how we designed the barge-in architecture.

Identity-aware barge-in requires biometrics, not just volume. Our initial interruption system was purely volume-threshold based. It worked in a quiet room and failed everywhere else. Adding speaker diarization — verifying that the loud audio is actually the user's voice via cosine similarity against a stored embedding — made the system robust to real-world noise environments.

Presence requires memory. We initially shipped the memory system as a "nice to have." After testing, it became clear that without it, OmniAi felt like meeting a stranger every session. With it, the character of the interaction changed fundamentally. Memory is what creates the sense that the agent is actually a companion and not just a tool.

Audio buffer management is its own discipline. Handling streaming PCM — chunk ordering, buffer sizing, playlist safely, WAV header injection, gain staging, hardware warm-pathing — requires deep knowledge of audio engineering that is not typically part of an app developer's toolkit. We had to develop that knowledge from scratch.

The ADK is genuinely worth using. We initially managed the Gemini Live API session manually (no ADK). It worked, but maintaining session state, reconnects, and tool dispatch by hand introduced bugs that were difficult to trace. Switching to ADK's Runner abstracted away most of this plumbing and allowed us to focus on product.

What's Next for OmniAi

Local RAG Integration — OmniAi currently knows what you tell it and what it observes. The next step is giving it access to your personal documents — medical records, lease agreements, financial statements, technical manuals — through a local Retrieval Augmented Generation pipeline. The agent would be able to answer questions grounded in your actual files, not just its training knowledge.

Wearable Hardware — The smartphone is the right first body for OmniAi, but not the ideal one. The vision is smart glasses with a forward-facing camera and an earpiece — a truly ambient companion that watches the world with you without requiring you to hold anything. The architecture (audio stream in, audio stream out, vision stream in) is already hardware-agnostic. The Flutter app is the layer that would change.

Proactive Emergency Agency — Currently OmniAi monitors and alerts. The next step is autonomous action in genuine emergencies. Caregiver Eye detecting a fall with no subsequent movement should be able to initiate a call to a designated contact without waiting for user confirmation. Scam Shield recognizing active financial fraud should be able to interrupt the call. OmniAi should be able to act, not just report.

Multi-User Memory Graphs — Shared context between trusted users. A caregiver and the person they care for sharing situational awareness through linked OmniAi sessions. A household where multiple users' memories and preferences are appropriately scoped and shared. The underlying memory architecture supports this; the API and permissions layer needs to be built.

Ecosystem of Third-Party Mode Plugins — The 12 current modes demonstrate the pattern. The architecture supports many more. A marketplace model where developers can define prompt contexts, tool sets, and visual UI configurations for specific use cases — and users can install them like apps — would unlock OmniAi as a platform rather than just a product.

OmniAi represents a fundamental shift in how we think about AI: from assistants that respond to companions that perceive.

Built With

android
dart
fastapi
flask
flutter
gcp
gemini
getx
gunicorn
ios
just-audio
multimodal-live-api
nginx
python
record
screen-sharing
socket.io
websocket