Inspiration
Every hedge fund has a team of analysts watching screens and calling out insights in real-time. They see a pattern forming, check the technicals, glance at the options flow, and tell the trader exactly what's happening. Retail traders? They get a text box.
When Google announced the Gemini Live API — real-time bidirectional audio and video streaming with AI — we realized we could close that gap. Not with another chatbot, but with something that actually watches your screen and talks to you like a colleague on the trading desk.
What it does
ORÁCULO is a real-time voice-and-vision market intelligence agent that:
- Sees your trading charts via screen share — identifies candlestick patterns, support/resistance levels, trend direction
- Hears your questions naturally — handles interruptions, follow-ups, and topic changes
- Speaks institutional-grade analysis — specific price levels, risk context, actionable observations
- Pulls live data on demand — 4 function calling tools: stock quotes, technical indicators (RSI, MACD, Bollinger Bands), market news with sentiment, and options snapshots (put/call ratio, max pain, top OI strikes)
How we built it
Backend: Python FastAPI server on Google Cloud Run. The Google GenAI SDK manages Gemini Live API sessions over WebSocket. The backend bridges browser audio/video to Gemini and executes function calls when the model requests market data.
Frontend: Vanilla JavaScript with Web Audio API. An AudioWorklet captures mic input at 16kHz PCM. Gapless audio playback at 24kHz. Screen/camera frames captured at 1 FPS via canvas extraction. Real-time waveform visualization using AnalyserNode — blue when the user speaks, gold when Oráculo responds.
Tools: Four async function-calling tools with dual-source architecture (Alpha Vantage primary, yfinance fallback). Technical indicators use 7 parallel API fetches via asyncio.gather (~300ms vs ~1.5s serial). All values pre-formatted for voice delivery with interpretation labels.
Infrastructure: Cloud Run with 1-hour WebSocket timeout and session affinity, Cloud Firestore for session logging, Cloud Build + Artifact Registry for CI/CD, Secret Manager for API keys.
Hardening: Context window compression (extends sessions from 2 min to unlimited), session resumption (survives WebSocket resets), producer-consumer queues with backpressure, input sanitization, rate limiting, security headers.
Challenges we ran into
Audio pipeline complexity — Gemini expects 16kHz PCM input and produces 24kHz output, but browsers run at 48kHz. Getting clean resampling, gapless playback, and instant barge-in handling required careful AudioWorklet + scheduling logic.
Function calling is manual in Live API — Unlike the standard Gemini API, the Live API doesn't support automatic function calling. We built our own tool_call → execute → FunctionResponse pipeline, preserving the critical id field that links responses to calls.
Voice-friendly data formatting — Raw numbers like 72.3456 become "seventy-two point three four five six" when spoken. We learned to format everything in the tool itself with interpretation labels so Gemini speaks naturally.
Session duration limits — Without context window compression, audio+video sessions terminate after ~2 minutes. Discovering and implementing ContextWindowCompressionConfig with SlidingWindow was critical for demo-length conversations.
Accomplishments we're proud of
- A working real-time voice + vision agent that genuinely feels like talking to a market analyst
- Clean barge-in handling — interrupt mid-sentence and the agent pivots naturally
- Max pain calculation in the options tool — institutional-level analysis most retail tools don't offer
- Full CI/CD pipeline with
cloudbuild.yamlthat builds, pushes, and deploys in one command - 42/42 checks passing on our internal hackathon compliance audit
What we learned
- The Gemini Live API is remarkably capable for real-time multimodal interaction — latency is genuinely conversational
- System prompt engineering for voice agents is fundamentally different from text — you must control response length, specify tool usage rules, and encode speaking patterns
- Pre-formatting tool responses for voice (not just data) is a design pattern worth sharing widely
- Producer-consumer queues with backpressure are essential for real-time streaming — drop frames rather than lag
What's next for ORÁCULO
- Real-time GEX (Gamma Exposure) calculations for dealer positioning awareness
- Multi-ticker watchlist with proactive alerts
- Session memory via Firestore to remember trader preferences across sessions
- Mobile-optimized UI for on-the-go market monitoring
Built With
- alpha-vantage
- fastapi
- google-cloud-build
- google-cloud-firestore
- google-cloud-run
- google-gemini-live-api
- google-genai-sdk
- javascript
- python
- web-audio-api
- websockets
- yfinance
Log in or sign up for Devpost to join the conversation.