ORÁCULO — Real-Time Voice + Vision Market Intelligence Agent

analysing QQQ
architecture-diagram.svg
concept

Inspiration

Every hedge fund has a team of analysts watching screens and calling out insights in real-time. They see a pattern forming, check the technicals, glance at the options flow, and tell the trader exactly what's happening. Retail traders? They get a text box.

When Google announced the Gemini Live API — real-time bidirectional audio and video streaming with AI — we realized we could close that gap. Not with another chatbot, but with something that actually watches your screen and talks to you like a colleague on the trading desk.

What it does

ORÁCULO is a real-time voice-and-vision market intelligence agent that:

Sees your trading charts via screen share — identifies candlestick patterns, support/resistance levels, trend direction
Hears your questions naturally — handles interruptions, follow-ups, and topic changes
Speaks institutional-grade analysis — specific price levels, risk context, actionable observations
Pulls live data on demand — 4 function calling tools: stock quotes, technical indicators (RSI, MACD, Bollinger Bands), market news with sentiment, and options snapshots (put/call ratio, max pain, top OI strikes)

How we built it

Backend: Python FastAPI server on Google Cloud Run. The Google GenAI SDK manages Gemini Live API sessions over WebSocket. The backend bridges browser audio/video to Gemini and executes function calls when the model requests market data.

Frontend: Vanilla JavaScript with Web Audio API. An AudioWorklet captures mic input at 16kHz PCM. Gapless audio playback at 24kHz. Screen/camera frames captured at 1 FPS via canvas extraction. Real-time waveform visualization using AnalyserNode — blue when the user speaks, gold when Oráculo responds.

Tools: Four async function-calling tools with dual-source architecture (Alpha Vantage primary, yfinance fallback). Technical indicators use 7 parallel API fetches via asyncio.gather (~300ms vs ~1.5s serial). All values pre-formatted for voice delivery with interpretation labels.

Infrastructure: Cloud Run with 1-hour WebSocket timeout and session affinity, Cloud Firestore for session logging, Cloud Build + Artifact Registry for CI/CD, Secret Manager for API keys.

Hardening: Context window compression (extends sessions from 2 min to unlimited), session resumption (survives WebSocket resets), producer-consumer queues with backpressure, input sanitization, rate limiting, security headers.

Challenges we ran into

Audio pipeline complexity — Gemini expects 16kHz PCM input and produces 24kHz output, but browsers run at 48kHz. Getting clean resampling, gapless playback, and instant barge-in handling required careful AudioWorklet + scheduling logic.

Function calling is manual in Live API — Unlike the standard Gemini API, the Live API doesn't support automatic function calling. We built our own tool_call → execute → FunctionResponse pipeline, preserving the critical id field that links responses to calls.

Voice-friendly data formatting — Raw numbers like 72.3456 become "seventy-two point three four five six" when spoken. We learned to format everything in the tool itself with interpretation labels so Gemini speaks naturally.

Session duration limits — Without context window compression, audio+video sessions terminate after ~2 minutes. Discovering and implementing ContextWindowCompressionConfig with SlidingWindow was critical for demo-length conversations.

Accomplishments we're proud of

A working real-time voice + vision agent that genuinely feels like talking to a market analyst
Clean barge-in handling — interrupt mid-sentence and the agent pivots naturally
Max pain calculation in the options tool — institutional-level analysis most retail tools don't offer
Full CI/CD pipeline with cloudbuild.yaml that builds, pushes, and deploys in one command
42/42 checks passing on our internal hackathon compliance audit

What we learned

The Gemini Live API is remarkably capable for real-time multimodal interaction — latency is genuinely conversational
System prompt engineering for voice agents is fundamentally different from text — you must control response length, specify tool usage rules, and encode speaking patterns
Pre-formatting tool responses for voice (not just data) is a design pattern worth sharing widely
Producer-consumer queues with backpressure are essential for real-time streaming — drop frames rather than lag

What's next for ORÁCULO

Real-time GEX (Gamma Exposure) calculations for dealer positioning awareness
Multi-ticker watchlist with proactive alerts
Session memory via Firestore to remember trader preferences across sessions
Mobile-optimized UI for on-the-go market monitoring

Built With

alpha-vantage
fastapi
google-cloud-build
google-cloud-firestore
google-cloud-run
google-gemini-live-api
google-genai-sdk
javascript
python
web-audio-api
websockets
yfinance

Updates

Edd Clandestino started this project — Mar 16, 2026 07:47 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.