Rhythm - رِتــم

"Home View" where the user can set his goal and start the run!
"Run View" where the user can see his live vitals and can interact with the coach (AI Agent)!
Logs of the server from the cloud run revision service where the server was deployed.

Inspiration

At the end of January 2026, I stood at the starting line of the Riyadh Marathon — my very first marathon, ever.

I wasn't a trained runner. I had signed up on impulse, undertrained, and had no strategy whatsoever. No coach. No idea what my target pace should be, when to push, or when to hold back. I was surrounded by thousands of people who clearly knew what they were doing, and I did not.

I finished. But somewhere around kilometer 30, when my legs were fading and I had no clue if I was going too fast or falling apart, I had one very clear thought:

Why don't I have someone in my ear right now?

Not a playlist. Not a training plan I'd read the night before. A real voice — one that knows my current pace, knows how far I've gone, and can tell me right now whether to push harder or back off.

That was the spark for Rhythm.

What it does

Rhythm is a real-time AI running coach that talks to you while you run.

You speak naturally — "How's my pace?" or "I'm feeling tired, should I slow down?" — and Rhythm responds instantly through your earbuds in natural speech, with full awareness of your current run data.

Live voice coaching powered by Gemini 2.5 Flash native audio
HealthKit integration gives the coach real-time access to your pace, distance, and heart rate
Conversational memory within a session — the coach remembers what you said earlier in the run
Two speech modes — push-to-talk for deliberate questions, always-on for natural mid-run conversation
Proactive coaching — the coach doesn't just answer questions, it intervenes when your pace drifts or you hit a milestone

The full round-trip — you speak, Gemini processes, you hear the response — targets under 2 seconds:

$$\Delta t_{\text{total}} = \Delta t_{\text{STT}} + \Delta t_{\text{network}} + \Delta t_{\text{Gemini}} < 2\text{s}$$

How we built it

The architecture is a three-layer pipeline:

$$\text{Voice} \xrightarrow{\text{on-device STT}} \text{Text} \xrightarrow{\text{WebSocket}} \text{Cloud Run} \xrightarrow{\text{Gemini Live}} \text{PCM Audio} \xrightarrow{\text{WebSocket}} \text{Earbuds}$$

iOS frontend (Swift) SFSpeechRecognizer handles on-device transcription — no audio ever leaves the phone. Final transcripts are sent to the backend as clean text via WebSocket. Gemini's PCM audio response streams back and plays through AVAudioEngine. Live run telemetry — pace, distance, heart rate — is pulled from HealthKit and embedded into every coaching context.

Backend (Python + FastAPI) Deployed on Google Cloud Run, the backend maintains a persistent WebSocket session per run. live_router.py bridges the iOS client and the Gemini Live API using three concurrent asyncio coroutines: receiving from the client, streaming audio back, and a heartbeat that keeps the Gemini connection alive during silent stretches.

async def run_session(websocket, live_session):
    await asyncio.gather(
        receive_from_client(websocket, live_session),
        send_to_client(websocket, live_session),
        heartbeat(live_session)
    )

AI layer (Gemini 2.5 Flash Native Audio) Gemini Live API $ v1\alpha $ handles reasoning and speech in a single step — no separate TTS pipeline. The Fenrir voice was chosen for its clear, energetic delivery, which is critical when a runner is physically exerting themselves and needs coaching cues to be immediately intelligible.

Challenges we ran into

Audio flooding killed the event loop

The first architecture streamed raw PCM directly from the iPhone microphone to the backend. iOS captures audio at high frequency — hundreds of small chunks per second, each sent as a WebSocket frame. The asyncio event loop was saturated with I/O, unable to respond to keepalive pings, and the Gemini connection kept dropping.

Buffering and batching didn't help — the problem was the volume itself. The fix was moving STT entirely on-device. SFSpeechRecognizer transcribes locally and sends one clean text message per utterance. The event loop went from thousands of ops per second to near-idle between turns.

Cloud Run is hostile to persistent connections

Cloud Run's defaults — scale-to-zero and a 300-second request timeout — silently kill any long-lived WebSocket session. A 10km run takes 50–60 minutes. Sessions were dropping after 5.

gcloud run deploy running-coach-live-api \
  --min-instances=1 \
  --timeout=3600

Both flags are non-negotiable for any live agent on Cloud Run.

Gemini Live drops during silence

Even with Cloud Run fixed, the Gemini Live WebSocket would time out during quiet stretches between questions. The SDK doesn't handle silence-period keepalives transparently. The solution was a heartbeat coroutine sending a transport-level ping every 8 seconds — because real runs have long quiet stretches, and the architecture has to survive them.

Accomplishments that we're proud of

True real-time conversation — not push-to-talk with a loading spinner, but a flowing exchange that feels like talking to a person
Zero audio on the network — the on-device STT architecture is the right solution, not a workaround; it's faster, more private, and more stable
Full-stack solo build — iOS frontend, Python backend, GCP infrastructure, and Gemini Live integration, built and shipped in days
The marathon closing — the app works well enough that I'd have wanted it at kilometer 30 in Riyadh

What we learned

Building a live agent is fundamentally different from building a chatbot. With a chatbot, a two-second response is fine. With a live agent running alongside someone mid-stride at race pace, two seconds is the difference between useful and useless.

The hardest problems were not AI problems — they were infrastructure and I/O problems. Gemini performed beautifully once it received a clean, stable connection. Getting that clean, stable connection was most of the work.

We also learned that the WebSocket is not an abstraction — it is a physical pipe with a lifetime, and you are responsible for keeping it alive. The SDK handles a lot, but not everything. Sometimes you need to go one layer deeper.

And a personal one: kilometer 30 of a marathon is where the race actually begins. Everything before it is just preparation.

What's next for Rhythm - رِتــم

Interval training programs — structured workouts where the coach actively manages phases (warm-up, intervals, recovery) rather than just answering questions
Race strategy mode — before a race, you set a goal time; the coach paces you dynamically across the full distance
Post-run debrief — a conversational summary after every run, with the coach remembering your session history across weeks
Arabic language support — full coaching in Arabic, starting with Saudi dialect, for the running community that inspired the whole project
App Store launch — making Rhythm available to every runner who has ever wished they had someone in their ear at kilometer 30