Mission Control

Inspiration

We were inspired by a harsh reality: millions of people are locked out of the digital world.

The Problem Statement from Oracle in the context of our project: An elderly grandparent can't figure out how to email their family. A visually impaired person can't read the tiny text on a grocery delivery website. A caregiver is too exhausted to sit at a computer and manage their parent's appointments. A student in crisis who can't even bring themselves to search the web for resources.

These aren't edge cases. They're the everyday reality for millions of people. The web was built without them in mind.

Traditional browser automation doesn't help because it fails silently. Hit a login wall? It stops. Need a CAPTCHA? Done. Encounter an unclear form? Game over. All-or-nothing automation is useless when you're the one who needs help.

We asked: What if the AI could just call you? Not to replace humans, but to partner with them. When it gets stuck, it reaches out. You speak naturally. It listens. It continues. The human stays in the loop : by voice, not by screen.

That's Mission Control.

What it does

Mission Control is a voice-interactive browser automation workflow designed for accessibility and human-centered AI.

The Flow:

You describe a task via web UI: "Please update my status on facebook"
The AI agent autonomously navigates real websites ; searching, clicking, reading, interpreting
If it gets stuck (login wall, CAPTCHA, ambiguous form)
Your phone rings. The agent asks: "What's your password?"
You speak naturally: "my-password-123"
Whisper transcribes your speech to text
The agent logs in and continues automatically
Task complete ✅

How we built it

Tech Stack:

Browser Agent: browser-use 0.12 + Playwright (web automation)
LLM: OpenAI GPT-5.2 (decision making & understanding)
Voice: Twilio Programmable Voice (outbound calls)
Speech-to-Text: OpenAI Whisper (transcription)
Backend: FastAPI + Flask (orchestration & webhooks)
Frontend: React 19 + TypeScript (web UI)

Architecture:

Browser Agent Layer — Uses browser-use with Playwright to autonomously navigate websites. The agent can click, type, scroll, read page content, and understand UI elements.
LLM Layer — GPT-5.2 powers decision-making. It analyzes page state, determines next actions, and detects when it's stuck. Stuck detection uses:
- Keyword matching (password, CAPTCHA, confirm, verify, etc.)
- UI pattern recognition (login forms, modals, error messages)
- LLM fallback analysis for complex scenarios
Voice Layer — When stuck, Flask webhook triggers Twilio to call the user. Their response is recorded and transcribed via Whisper, then fed back to the agent.
Frontend Layer — React UI lets users input tasks, watch the agent work in real-time, and receive notifications when they're called.

Development Process:

Started with autonomous browsing (can the agent navigate at all?)
Implemented stuck detection (how do we know when to call?)
Integrated Twilio voice calls (can we actually reach the user?)
Built speech-to-text pipeline (can we understand what they say?)
Created the feedback loop (does the agent resume correctly?)
Built the React UI for monitoring and control

Challenges we ran into

Stuck Detection is Hard
- How do you programmatically know an agent is stuck? There's no universal signal.
- Solution: Multi-layered approach combining keyword detection, UI pattern matching, and LLM analysis
- Challenge: Avoiding false positives (calling the user when they're not needed) and false negatives (agent spinning its wheels without asking)
Real Websites Are Messy
- Every website structures its login differently. Some use forms, some use custom JS, some have multiple steps.
- Playwright can find elements, but understanding intent is the real challenge
- Solution: Rely on GPT-5.2's vision capabilities to interpret page layouts, not just DOM structure
Voice Transcription Context
- Whisper is great but can misunderstand context (especially passwords, which sound weird when spoken)
- Solution: Agent provides context to Whisper API about what it's asking for
Latency & User Experience
- Making an outbound call takes time. Users might not be ready. Phone might not be charged.
- Solution: Agent composes a clear question first, gives user time to answer, handles silence/errors gracefully
Privacy & Trust
- Users sharing passwords over voice is risky. Storing credentials is risky.
- Solution: Never log credentials, delete voice recordings after transcription, use HTTPS only

Accomplishments that we're proud of

✅ Built a working human-AI voice loop — This is genuinely novel. Most automation is all-or-nothing. We created a system that actively calls for help.

✅ Real browser automation — Not just a chatbot. The agent actually navigates real websites, fills real forms, and reads real page content.

✅ Multi-step interactions — The agent can handle 5+ step tasks that require human input mid-way through, not just at the beginning.

✅ Graceful error handling — If transcription fails, if the user doesn't pick up, if the agent misunderstands—there's a recovery path.

✅ Clean architecture — Separated concerns: agent layer, voice layer, UI layer. Easy to test, extend, and understand.

What we learned

LLMs are surprisingly good at "stuck detection" — We initially thought we'd need complex heuristics. Turns out, GPT-5.2 is really good at analyzing "is this page asking for user input?" with just a few examples.
Twilio is powerful but needs careful handling — Outbound calling is simple API-wise but tricky operationally (calling at the right time, handling failed calls, managing phone numbers).
Accessibility benefits everyone — The interface isn't just for differently abled people. Someone busy, someone with their hands full, someone who just is feeling lazy—they all benefit.
Privacy > Feature completeness — We could store user preferences, credentials, history. But we chose not to. The constraint actually made the product better: simpler, faster, more trustworthy.
Testing voice systems is hard — Unit tests work fine. Integration tests with real Twilio calls? Much harder. We learned to mock heavily and deploy small.

What's next for Mission Control

Phase 1: Multi-turn Voice Conversations (In Progress)

Currently: Agent asks one question per call
Next: Back-and-forth dialogue during a single call (more natural, fewer interruptions)
Example: Agent asks for email, user says it, agent asks to confirm, user confirms—all in one call

Phase 2: User Profiles & Memory