🐢 Posture Police

The Live Agent That Roasts You Back to Health

A hands-free, sassy AI desk companion powered by Gemini Live API — that doesn't remind you to sit up. It argues with you until you do.

💡 Inspiration

I'm a family medicine physician.

Every day in clinic, I see the same pattern: office workers and engineers walking in with cervical pain, stiff shoulders, chronic headaches. Different symptoms. Same root cause — hours of staring at screens, unconsciously collapsing into "tech neck."

The numbers are staggering:

Statistic	Source
80.8% of office workers have experienced work-related musculoskeletal symptoms — neck (58.6%), lower back (52.5%), and shoulders (37.4%) most commonly	Journal of Occupational Health, 2025
67.9% of computer-using office workers report neck discomfort; 66.3% report back discomfort within any given week	Occupational Medicine, 2023
50–60% of heavy device users show measurable forward head posture, consistently linked to longer daily screen use	Multiple peer-reviewed studies, 2022–2023
A 60° head tilt loads approximately 27 kg of force on the cervical spine — like carrying a small child on your neck all day	Hansraj, Surgical Technology International, 2014

One of the most striking patterns in digital-age health is that the very people building and using technology the most are often the ones most affected by it. According to Stack Overflow's 2023 Developer Survey, the average developer spends 8.8 hours/day at a screen, with 84% barely leaving their seat — and this pattern extends far beyond developers. Designers, data analysts, remote workers, students, gamers — anyone logging long hours at a screen is part of the same risk profile.

WHO estimates that around 1.71 billion people worldwide live with musculoskeletal conditions. Neck pain alone affects approximately 222 million people globally — one of the leading causes of disability across regions. This isn't a niche problem. It's a global epidemic hiding in plain sight, one bad chair at a time.

As a physician, I started noticing a consistent clinical picture: forward head posture, rounded shoulders, text-neck curvature. Not in one profession — across all of them. And the trend is accelerating.

Traditional posture reminder apps have a fatal flaw: the human brain habituates to repeated alerts.

Research shows repetitive notifications are ignored 90% of the time within 14 days (Behavioral Science & Policy, 2021). A popup. A chime. Dismissed in 0.3 seconds. Back to slouching.

But the same research tells us something else: emotionally responsive voice agents increase health behavior compliance by 42% (NTT Human Interface Labs, 2023). Affective feedback maintains habits at 3× the rate of rational reminders (Stanford Fogg Behavior Model, 2020).

That reframed the problem entirely:

What if your computer didn't remind you — what if it argued with you?

Posture Police isn't another notification you ignore. It's a Live Agent with attitude — one that roasts you, opens a banter window, listens to your excuses, and fires back harder.

As a physician, I know that getting people to sustain health behaviors is ten times harder than designing them. This project is our attempt to translate clinical behavior-change logic directly into Live Agent architecture.

But beyond the science — there's a feeling.

There's a particular feeling you get when the AI roasts you.

It stings. And then — strangely — it makes you smile.

Because somewhere in the middle of staring at a screen for eight hours, feeling invisible and exhausted, something actually noticed you. Something cared enough to be annoying about it.

It's sarcastic. It's relentless. It has absolutely no sympathy for your excuses.

But it's paying attention to your back, your spine, your future self — the one who doesn't want a walking frame at sixty, or a steel rod holding their vertebrae together.

"An AI roasting me is still better than spinal fusion surgery."

That's not just a feature. That's Tough Love through Technology — the oldest human dynamic, finally running on Gemini Live API.

When something cares enough to be rude to you, you stop feeling like a passive user staring at a screen. You become a participant. You talk back. You sit up. You feel, for a moment, like something in your workday actually noticed you were there.

That feeling — that's what we were building toward.

🎭 What It Does

Posture Police is a hands-free real-time Live Agent that turns posture correction from a chore into a verbal duel.

👁️ Edge Vision — Privacy-First Posture Detection

Runs entirely in your browser using TensorFlow.js + MediaPipe BlazePose.

Calculates a dynamic Neck Height / Face Width posture ratio in real time. Zero video leaves your device. Millisecond latency. Full privacy.

📐 Temporal Accumulator — No False Accusations

We scrapped instant-trigger logic. The agent only fires after 5 continuous seconds of slouching, with a progressive badge warning system:

🟠 Adjusting (1–3s)  →  🟡 Warning (3–5s)  →  🔴 Slouch Detected (5s+)

Stretch both arms up? Stretch Mode kicks in — detection pauses automatically. Face occluded? Occlusion Guard freezes the timer. No innocent bystanders.

🔥 Gemini Live — Real-Time Roast Generation

Once slouch is confirmed, the system fires a posture_event over WebSocket to the Cloud Run backend, which relays it to Gemini Live API. The response is a sharp, unique roast — generated fresh every time:

"Is your spine made of jelly?" "Sit up straight — are you auditioning to be the Leaning Tower of Pisa?" "Your spine doesn't have a vacation policy."

No pre-written lines. No repetition. Every roast is live.

🎤 Hands-Free Banter Engine — Talk Back. Get Roasted Harder.

After the roast, the agent automatically opens a 10-second live listening window.

No buttons. No typing. Just speak:

"But I'm tired…" → Gemini fires back instantly. "I was just stretching!" → Gemini doesn't buy it. "Okay fine, I sat up." → Gemini verifies your posture and shifts to praise mode.

This is not a chatbot. It's a live verbal sparring partner.

✨ Reinforcement Loop — Punishment AND Reward

When you claim compliance, the system verifies it in real time. If your posture ratio genuinely recovers above baseline, the agent shifts tone:

"Good. Hold that for 30 seconds." "Now THAT'S what a spine looks like."

Confirmed? You earn a 30-second Grace Period — a brief reprieve before the watch resumes. This closes the behavioral loop: not endless punishment, but a responsive coach that reacts to intent and recovery.

📴 Offline Graceful Degradation — Never Goes Silent

Network down? API unreachable? The posture detection keeps running. The system automatically switches to a local roast library — core functionality never drops. This isn't a fallback. It's a deliberate engineering choice.

🎨 Live State Badge System — Transparent Agent Logic

A floating glass-panel badge displays four real-time system states:

Badge	States
🟢 Posture	Upright / Slouching / Stretch Mode
🟡 Banter	Idle / Listening / Responding / Cooldown
🎤 Audio	Live Mic / Text Fallback / Blocked
👁️ Vision	OK / Occluded / Lost

Judges can watch the agent's internal state machine in action — live.

📍 Personal Baseline Calibration — Built for Every Body

No two people sit the same way. A 190cm developer and a 155cm physician sitting at the same desk have completely different natural posture ratios.

Posture Police doesn't judge you against a universal standard. It judges you against yourself.

Hit Calibrate once while sitting upright — the system locks your personal Neck Height / Face Width baseline. From that moment, every detection is relative to your body, your chair, your setup. Three people sharing one workstation? Each calibrates once. Each gets a personalised experience.

This was one of the hardest design problems to solve — and one of the most important.

🔲 Minimized Mode — It Never Stops Watching

Collapse the interface to a small corner widget and get back to work.

The camera keeps running. The posture detection keeps running. The AI keeps watching.

Slouch while writing code? It will interrupt you. Slouch during a Slack conversation? It will interrupt you. Think you can hide because the window is small?

You can't.

This is the difference between a tool you open and a companion that stays. A real desk agent doesn't need your full attention to do its job. It just needs you to still be sitting there — which, unfortunately for your spine, you are.

📷 Smart Camera Filter — No Virtual Camera Hijacking

Many developers run virtual cameras (OBS, phone mirrors, third-party overlays). Posture Police automatically blacklists virtual and software-based sources, locking onto the physical built-in camera.

You get the right camera. Every time. Without thinking about it.

🌐 Bilingual — English & Chinese, Automatically

Posture Police speaks your language.

The agent defaults to English — making it immediately accessible to international users. The moment it detects you speaking Traditional Chinese, it switches seamlessly. Switch back to English? It follows.

Want to try the bilingual experience? Talk back in Chinese.

🏗️ How We Built It

Architecture: Edge AI × Cloud LLM

We designed a hybrid architecture that keeps privacy on-device while pushing real-time intelligence to the cloud:

┌─────────────────────────────────┐
│         Browser (Frontend)       │
│  TensorFlow.js · Web Audio API   │
│  MediaPipe BlazePose · Web Speech│
└──────────────┬──────────────────┘
               │ WSS (posture_event / PCM audio)
               ▼
┌─────────────────────────────────┐
│    Cloud Run · us-central1       │
│    Node.js WebSocket Proxy       │
│    Workload Identity Auth        │
└──────────────┬──────────────────┘
               │ WSS (BidiGenerateContent)
               ▼
┌─────────────────────────────────┐
│    Vertex AI · Gemini Live API   │
│  gemini-live-2.5-flash-native-audio │
│  Real-time bidirectional audio   │
└─────────────────────────────────┘

⚠️ This is not a request-response architecture.

Most "AI voice apps" are glorified chatbots — ask, wait, reply, repeat. Posture Police uses Gemini Live API's BidiGenerateContent bidirectional stream. The AI can interrupt you. You can interrupt the AI. The rhythm is real human conversation — not scripted turn-taking. That's what makes it a Live Agent, not a voice wrapper.

Frontend — Firebase Hosting

TensorFlow.js + MediaPipe BlazePose — all vision processing runs locally, zero video upload
Web Audio API — receives Gemini's PCM stream, decodes and plays in real time
Web Speech API — captures banter input, streams PCM audio to backend

Backend — Google Cloud Run

Node.js + Express WebSocket proxy server
Workload Identity Federation — API credentials never exposed to frontend
Receives posture_event from frontend → relays to Gemini Live API
Streams Gemini's PCM audio response back to frontend in real time

Gemini Live API — Vertex AI

Parameter	Value
Model	`gemini-live-2.5-flash-native-audio`
Voice	Aoede (female)
Output format	24kHz PCM
Protocol	`BidiGenerateContent` bidirectional WebSocket
Persona	Sarcastic posture enforcer, bilingual (English default, switches to Traditional Chinese on detection)

Conversation State Machine

To prevent posture detection, speech recognition, and LLM responses from colliding in real time, we implemented a strict state machine:

TURN.IDLE
    │
    ▼ slouch detected
TURN.ROASTING
    │
    ▼ roast delivered
TURN.LISTENING  ←── 10s banter window
    │
    ▼ user speaks
TURN.RESPONDING
    │
    ▼ + 2s Banter Lock cooldown
TURN.IDLE

Banter Lock prevents posture events from retriggering mid-conversation. One-Roast-Per-Round rule prevents chaotic looping. The result: an agent that feels intentional, not reactive.

🛡️ Zero-Friction by Design — Protection Without Login Walls

We made a deliberate decision: no login, no CAPTCHA, no account required.

A posture app that makes you sign up before it roasts you has already lost. The entire value proposition is instant, frictionless access — open the browser, calibrate once, get roasted. Friction kills health tool adoption before it starts.

But "no login" doesn't mean "no protection." We built a backend defense layer that protects the service without touching the user experience:

Layer	Protection	Why
Cloud Run	Max 1 instance, 5 concurrent sessions	Hard cap on API exposure
Backend	Max 2 simultaneous connections per IP	Allows reconnection buffer
Backend	Max 60 minutes cumulative per IP per day	Fair daily usage cap
Backend	60 messages/second rate limit	Handles TensorFlow.js per-frame events
Backend	120s idle timeout → auto-disconnect	Clears ghost connections silently
Session	8-minute warning → 10-minute hard cutoff	Graceful end, not hard crash
GCP	$5/month budget alert	Human-in-the-loop cost control

The result: a service that's open to everyone, but not exploitable by anyone. Users never see the protection layer — they just see an app that works.

The best security is the kind users never notice.

⚔️ Challenges We Ran Into

🔴 Challenge 1 — Gemini Live API: Persistent 1008 Errors

Problem: Every backend connection attempt to Gemini Live API closed immediately with error code 1008 — no further explanation.

Root cause: Two simultaneous issues:

Invisible non-ASCII characters had been injected into the JSON setup message during Cloud Shell editing — causing silent API validation failure
Our model name gemini-2.0-flash-live-001 had been deprecated — the API rejected it without a clear error

Solution: Used Python to read and print the raw server.js setup message byte-by-byte, identified and stripped the invisible characters. Verified the current active model name against official documentation: gemini-live-2.5-flash-native-audio. Locked it in FACTS.md to prevent future regression.

Result: Connection success rate went from 0% to stable. Zero 1008 errors since.

🔇 Challenge 2 — Audio Data Arriving, But No Sound Playing

Problem: Cloud Run logs confirmed Gemini was returning audio data. The frontend AudioContext received… nothing.

Root cause: Node.js WebSocket automatically wraps binary data in a Buffer object during relay. The frontend received a Buffer, not a string — JSON.parse() failed silently, and the entire audio payload was discarded.

Solution:

// One line. That's it.
frontendWs.send(data.toString());

Sound appeared immediately.

Lesson: In cross-environment WebSocket relay, Buffer/String boundaries are nearly always a trap. Always verify what your receiver actually receives.

⏱️ Challenge 3 — Gemini Session Dying After ~60 Seconds

Problem: Gemini Live API sessions closed automatically after roughly one minute (clean 1000 close), forcing reconnection and breaking the experience.

Root cause: Gemini Live API requires a continuous audio input stream to keep the session alive. Our microphone only streamed during active banter windows — leaving the session starved of input during idle detection periods.

Solution: Start continuous microphone streaming immediately on launch, not only when banter mode opens. Session lifespan extended from ~60 seconds to 5+ minutes — covering the full demo window comfortably.

🔌 Challenge 4 — Silent Disconnection With No Recovery

Problem: When the backend WebSocket dropped entirely (not just Gemini closing), the frontend logged the disconnect and gave up — no retry, no recovery.

Root cause: We handled gemini_closed events with reconnection logic, but forgot liveWs.onclose — the case where the entire backend connection drops.

Solution:

liveWs.onclose = () => {
    log("🔌 Backend connection lost");
    isLiveReady = false;
    setTimeout(connectToLiveBackend, 3000); // Auto-retry after 3s
};

Result: The agent now self-heals from disconnection without user intervention.

📚 What We Learned

Technical

State control is everything. Real-time multimodal agents don't fail because of bad APIs — they fail because async events collide without a disciplined state machine.
"Works locally" is the starting point, not the finish line. Deployment latency amplifies every timing assumption you made in a zero-latency local environment. Ship early. Ship often.
Buffer/String boundaries in WebSocket relay are a near-universal trap. Always verify the actual type of what your receiver gets — not what you assume it sends.
Production-grade cloud architecture isn't over-engineering for a hackathon. Cloud Run + Vertex AI + Workload Identity is what makes a demo stable, secure, and genuinely live — not demo-mode theatre.
Browser compatibility is not optional. Web Speech API behaves differently across browsers — Chrome is stable, Edge returns spurious network errors after conversation rounds. Test on your target browser early. When in doubt, surface a browser recommendation at launch.
Your own protection layer can become your worst enemy. Rate limiting by connection count per hour sounds safe — until your own reconnection logic triggers it, locking out legitimate users in a silent loop. Design protection around cumulative usage, not connection frequency.

Design

Behavior change requires a closed loop. Punishment alone doesn't work. The full cycle — roast → listen → verify → reinforce → grace period — is what creates sustainable impact.
Visualizing internal state builds user trust. The live badge system turned invisible agent logic into observable behavior — for both judges and users.
Friction is the enemy of health tools. No login. No CAPTCHA. No account. Open the browser, calibrate once, get roasted. Every extra step is a dropout point.
As a physician: designing health interventions is easy. Getting people to sustain them is the hard problem. Grace Period, praise mode, progressive warnings — these aren't UI embellishments. They're clinical behavior-change logic translated directly into agent design.

🚀 What's Next

Near-term

🎭 Custom Personas — Drill Sergeant / Disappointed Asian Parent / Passive-Aggressive Coworker. Choose your own punishment style.
📊 Weekly Posture Report — "You spent 47% of this week impersonating a shrimp."

Mid-term

🧑‍💻 IDE Integration Mode — Slouch too long, VS Code pauses typing. Sit up to resume. Productivity contingent on posture.
📱 Multi-screen Coverage — Multiple cameras, no escape routes.

Long-term

🏥 Enterprise Wellness — Anonymized team posture health reports, integrated with HR systems.
🌐 Multilingual Roasting — Already bilingual (English + Traditional Chinese). Next: Cantonese, Japanese, Spanish. Every language has its own philosophy of sarcasm.

Posture Police — because your spine deserves to be taken seriously, even if the method is getting verbally destroyed by an AI.

Built with Google Cloud Run · Vertex AI · Gemini Live API · Firebase Hosting · TensorFlow.js This project was created for the purposes of entering the Gemini Live Agent Challenge 2026. #GeminiLiveAgentChallenge

Built With

firebase-hosting
gemine-live-api
google-cloud
google-cloud-run
tensorflow.js
vertex-ai

Submitted to

Gemini Live Agent Challenge

Created by

Describe Your Contribution

My background is in family medicine, not software engineering. I came into this hackathon as someone who understands why people fail at health behavior change — not someone who knows how to build a WebSocket proxy on Cloud Run.
That gap turned out to be the most valuable thing I brought to this project.
Traditional posture apps fail because they treat the problem as an information problem — "you don't know you're slouching, so we'll tell you." But that's not the real problem. Everyone knows slouching is bad. The real problem is that humans don't respond to passive, emotionally neutral reminders. We habituate. We dismiss. We forget within seconds.
What actually changes behavior is friction, emotion, and social accountability. In clinical terms: you need a feedback loop that triggers an emotional response strong enough to interrupt the habit.
That's the insight behind Posture Police. Not "remind the user more often." But "make the reminder impossible to ignore — by making it argue back."
The sarcasm isn't a gimmick. It's a deliberate design choice grounded in behavioral science. Emotional responses are harder to dismiss than neutral ones. And when the AI opens a banter window and listens to your excuses — then fires back harder — it creates a micro social dynamic that a popup notification can never replicate. You can't "dismiss" something that's already talking.
On the technical side, I learned something that no tutorial teaches you: local works, deployed breaks. Every timing assumption I made in a zero-latency local environment became a bug the moment it hit Cloud Run. The 60-second Gemini session timeout. The Buffer-to-String WebSocket relay issue. The microphone only streaming during banter mode. None of these existed locally. All of them appeared the moment the system was real.
Building on Google Cloud forced me to think about architecture the way a production engineer would — not just "does it work" but "does it stay working, at scale, without exposing credentials, with real network latency." Workload Identity instead of JSON keys. Continuous mic streaming to keep sessions alive. A state machine strict enough to prevent async collisions between vision, voice, and LLM responses.
I came in as a physician. I'm leaving as someone who understands, in their bones, why deployment is a different discipline from development.
And somewhere along the way, the AI told me my spine had no vacation policy — and I sat up straighter.

Chloe Kao
I did not enter this space through an IDE. I entered it through a dialogue box.