AgriLive: Multimodal Farm Assistant

Nishal K posted an update — Mar 15, 2026 06:29 AM EDT

V1.0 Submitted: Engineering AgriLive for the Real World!

Hey everyone! I just officially hit "Submit" on AgriLive: Multimodal Farm Assistant for the Gemini Live Agent Challenge, and I wanted to share a quick look under the hood at how this project evolved from a raw concept into a production-ready application.

When I started, the goal was simple: help farmers in regions like Kerala fight climate volatility without forcing them to navigate complex text menus. But building a real-time, voice-and-vision AI over intercontinental networks? That required some serious engineering pivots.

The Evolution & Key Features

From Text to Native Audio: We completely ditched the text-box paradigm. AgriLive now uses gemini-live-2.5-flash-native-audio over WebSockets for a continuous, bidirectional, and empathetic voice conversation.
The "Walkie-Talkie" Protocol: Vertex AI expects a continuous audio stream and will drop the connection if the user goes silent. My favorite hack of the weekend? Building a "Walkie-Talkie" mode that dynamically streams an array of zeros (pure silence) to Google's servers whenever the AI is speaking, keeping the WebSocket completely stable!
Beating Ocean Latency: Streaming raw audio packets from India to the us-central1 servers caused brutal audio stuttering. I engineered a custom frontend Jitter Buffer with a "fast-track re-entry" mechanism to ensure the 24kHz PCM audio playback stays incredibly smooth, even on weak rural networks.
Concurrent Vision Agent: While the user is talking, they can snap a picture of a diseased crop. A backend cascading fallback engine routes the image to gemini-2.5-flash, enforcing a strict Pydantic Structured Output to guarantee the UI gets a perfectly parsed JSON diagnosis every single time.

Snippet Spotlight: The Fast-Track Jitter Buffer

Here is a peek at the logic that stitches delayed audio packets back together mid-sentence without forcing the browser to wait for a full buffer stockpile:

// Push incoming Float32 PCM data to the jitter buffer
audioQueue.push(float32);

// Fast-track re-entry: If the network lags but the AI is already mid-sentence,
// bypass the buffer threshold and play the audio immediately!
const isMidSentence = btnStart.classList.contains("speaking");

if (!isPlaybackStarted && (audioQueue.length >= JITTER_BUFFER_THRESHOLD || isMidSentence)) {
    startPlaybackLoop();
}

Log in or sign up for Devpost to join the conversation.