Inspiration

I like to consider myself a "pretty active person", and as a "pretty active person", it feels like new problems show up all over my body constantly (recently, my hip has been killing me). As we age, it gets harder and harder to deal with these issues, and we become more and more at risk because of them. During this hackathon, I developed PhysiHow for this cause.

Physihow was built on openly available physiotherapy exercises for people with Osteoarthritis @ link. Physihow allows users to choose from a selection of these exercises and receive personalized AI-powered guidance on how to do exercises. In fact, poor posture increases injury risk by 70% in demographics like seniors. AI in health care is a widely risky topic, but with PhysiHow, an AI coach is directly fed key exercise information and actively avoids giving users medical advice.

How we built it

PhysiHow was built on top of Gemini 2.5 Flash Native Audio, Google's real-time multimodal Live API model. Rather than fine-tuning, the model is guided by a structured system prompt containing the full exercise instructions sourced from the University of Melbourne CHESM knee and hip osteoarthritis video library, giving it grounded, exercise-specific knowledge from the session start.

The architecture is a FastAPI WebSocket backend paired with a TypeScript/Vite frontend. The frontend simultaneously streams 16 kHz PCM microphone audio to the backend every 200 ms and sends a camera frame every second, both forwarded to Gemini Live via send_realtime_input. The model processes video at 1 fps and responds with 24 kHz PCM audio, which the frontend schedules gaplessly using the Web Audio API.

A big problem with Gemini's in-app application of this model is its tendency to hallucinate things it can't see. To avoid Gemini describing the user's form even when it has no video context, I solved this with a system prompt constraint requiring the model to only assert observations it can clearly see and to acknowledge uncertainty rather than invent corrections. Surprisingly, this simple change already demonstrated significant improvements over the default infrastructure.

How was tool calling implemented? (Railtracks)

The coach acts as an agent, with access to three main tools: a timer, a session recording, and the ability to suggest exercises.

Railtracks was used to establish a session recording agent and an agent capable of suggesting exercises. Each agent is defined as an rt.agent_node backed by Gemini 2.5 Flash, and invoked asynchronously via await rt.call(agent, prompt) inside FastAPI POST endpoints (/api/suggest-exercise and /api/compile-session), keeping them compatible with FastAPI's async event loop without blocking.

The exercise suggestion agent is equipped with a get_exercises function node — a Railtracks @rt.function_node that reads the local exercise catalog and returns each exercise's name and description. The session notes agent receives the full session transcript and any additional user notes, then compiles them into a structured Markdown document with standardized clinical headings.

Challenges we ran into

This project went through multiple iterations in a short time before arriving at its current architecture.

The first approach used Dynamic Time Warping (DTW) against pose keypoint templates extracted with MediaPipe, looking at how similar a sequence of body positions matches the reference. This was initially promising as it allowed us not to have to train a deep learning model on our limited amount of exercise videos. In fact, this technique is often used in similar data-constrained situations. However, the model was too simple to distinguish the slow, controlled movements of physiotherapy exercises from one another. Knee extensions and hip abductions look nearly identical at 2D keypoint level, and DTW has no notion of depth or viewing angle, so a patient filmed from the side versus the front would produce completely different keypoint sequences for the same exercise. I also experimented with long short-term memory (LSTM) sequence classifiers and VideoMAE, a transformer trained on action recognition, but both ran into problems with precision and the limited dataset.

Physiotherapy exercise datasets are sparse; there are many different exercises, but getting several videos on the same exercise performed from multiple angles, at different speeds, or by patients with limited mobility wasn't possible. The previous methods needed more data to generalize. Instead of trying to classify what the user is doing, the application can now explain, correct, and encourage in the user's native language.

Accomplishments that we're proud of + What we learned

During this hackathon, I learned multiple computer vision techniques and processes, and became all too familiar with their shortcomings. Luckily, multi-modal LLMs solved the gap I encountered, and I feel like I could genuinely use this app to learn new exercises.

What's next for Physihow

Classifying the exercise a user performs as an onboarding was the initial goal, allowing users to almost Shazam the exercise they knew and learn how to do it correctly. This was sadly out of scope for this project, but has potential to be added in the future.

Built With

Share this project:

Updates