Inspiration
Most AI agents only read text. They sit behind a chat box waiting for you to type. But the real world isn't text — it's visual, auditory, spatial. Field researchers, journalists, insurance adjusters, construction inspectors — millions of people go into the real world every day to observe things, then spend hours manually organizing their notes. We asked: what if your AI could come with you, see what you see, and do the work of capturing, organizing, and reporting autonomously?
The hackathon theme — "Build agents that don't just think, they act" — was the perfect challenge. We wanted to build an agent that genuinely acts without human intervention. Not a chatbot. Not a button you press. An AI that watches the world and decides on its own what matters.
What it does
Field Notes is an autonomous AI field research agent. You point your phone camera at the world, and the agent:
- Sees and hears everything through your phone's camera and microphone via Gemini Live API
- Talks back to you naturally — commenting on what it sees, asking questions, being an enthusiastic research companion
- Autonomously logs structured observations when it spots something noteworthy — no button press, no timer, the AI decides on its own using Gemini function calling
- Enriches observations with web context automatically via Tavily — when it sees a brand or logo, it searches the web for context without being asked
- Stores everything in persistent cloud memory via Senso Context OS — observations survive beyond the session and are semantically searchable
- Provides a dashboard where you can query all observations in natural language via an assistant-ui chat interface ("What patterns did you notice?" "Summarize the hackathon projects I saw")
- Generates structured field research reports with executive summaries, key findings, and pattern analysis
- Exports data via Nexla Express.Dev for downstream pipeline creation to Google Sheets, databases, or any of 550+ connectors
How we built it
A single Gemini Live API WebSocket connection handles everything — bidirectional audio conversation, real-time video understanding at 1fps, AND structured observation extraction via function calling. We defined a log_observation function tool that Gemini calls autonomously when it spots something noteworthy.
The server acts as a WebSocket relay between the phone browser and Gemini, intercepting tool calls to store observations and trigger an autonomous pipeline using 5 sponsor tools:
Google DeepMind (Gemini) powers the core brain. Gemini Live API (gemini-3.1-flash-live-preview) streams video at 1fps and audio via WebSocket. Gemini autonomously calls the log_observation tool when it sees something interesting — a hackathon project, a logo, food, signage, people presenting. A 15-second nudge system keeps Gemini actively watching. For dashboard chat and report generation, Gemini 2.5 Flash reasons over all stored observations.
Senso Context OS provides persistent cloud memory. Every observation auto-syncs via S3 presigned upload, creating a searchable knowledge base. The dashboard uses Senso's semantic search to find relevant observations using natural language.
Tavily Search API handles autonomous web enrichment. When Gemini spots a brand or logo, it flags the observation, and Tavily automatically fetches company info, news, and context — no human action required.
assistant-ui powers the dashboard chat interface with useLocalRuntime and a custom ChatModelAdapter, giving users a ChatGPT-quality conversational interface to query their observations.
Nexla Express.Dev enables data export — the dashboard downloads structured CSV and opens Express.Dev for pipeline creation to Google Sheets or any destination.
We iterated through 3 architectural versions in one day — from REST polling every 10 seconds, to 5-second polling, to Gemini function calling for truly autonomous observation.
Challenges we ran into
Getting Gemini to reliably call the log_observation tool was the hardest challenge. The AI would sometimes focus on conversation and forget to log observations. We solved this with aggressive system prompting and a 15-second nudge system where the server periodically reminds Gemini to keep observing via realtimeInput text messages.
The Senso integration required a mid-hackathon rewrite — we switched from REST content creation to S3 presigned upload flow, which required debugging the upload URL, content hashing, and content type handling.
The biggest architectural decision was migrating from a dual-model system (Gemini Live for conversation + separate Gemini Vision REST calls for observations) to a single-model function calling approach. This was a risky mid-day pivot that eliminated an entire class of bugs and made the agent truly autonomous — Gemini decides when to observe, not a timer.
Phone camera access required HTTPS, which we solved with Cloudflare Quick Tunnels for zero-config HTTPS tunneling to our local development server.
Accomplishments that we're proud of
9 autonomous observations captured by the AI without any human button press — the agent sees the hackathon through a phone camera and decides on its own what's worth noting. We built a complete observe → remember → enrich → report pipeline using 5 sponsor tools in under 6 hours, iterating through 3 complete architectural versions. The function calling approach means the AI is genuinely autonomous — it's not on a timer, it's making its own decisions about what matters. The generated field research report synthesizes all observations into structured findings with executive summary, category analysis, and pattern detection. All 9 observations successfully synced to Senso's persistent cloud knowledge base and are semantically searchable.
What we learned
Function calling in the Gemini Live API is the right pattern for autonomous agents — it lets the AI decide when to act rather than relying on polling timers. Single-connection architectures are more elegant and lower-latency than dual-model approaches. Context window compression and session resumption are essential for sustained video sessions beyond the 2-minute default limit. Building an autonomous agent that genuinely acts without human intervention requires careful prompt engineering and nudge systems, not just good architecture. Senso's S3 upload flow is the right pattern for persistent agent memory that outlives sessions. Tavily's structured search results are perfect for autonomous enrichment because they're already LLM-ready.
What's next for Field Notes
Multi-session intelligence — comparing observations across different days, locations, and events to detect trends over time. Real-time collaboration where multiple field researchers contribute to a shared knowledge base simultaneously. Deeper Nexla integration with automated scheduled exports and live data pipelines. Gemini Embedding 2 for multimodal semantic search — finding observations by visual similarity, not just text. Mobile-native app with offline observation caching and background sync. Domain-specific versions for construction site inspection, real estate walkthroughs, journalism field reporting, and ecological surveys.
Built With
- assistant-ui
- cloudflare
- express.js
- gemini
- gemini-2.5-flash
- gemini-live-api
- nexla-express.dev
- node.js
- react
- senso-context-os
- tailwind-css
- tavily-search-api
- typescript
- vite
- websockets
Log in or sign up for Devpost to join the conversation.