Inspiration

Learning yoyo tricks is notoriously difficult. Static YouTube tutorials require you to stop, rewind, and rewatch while holding a spinning object. We want to build a coach that has eyes, a voice, and a memory. Instead of just an AI that just 'nags' you constantly, we built a Film Room Review system - exactly how Olympic athletes train - where a master watches your entire attempt and provides deep, biomechanical feedback only when you are ready to listen

What it does

Yoyo-Sensei Sync is a multimodal AI coaching platform. It features a unique 'Record-and-Review' flow:

  • Step into the ring: record a 3-10 second clip of your attempt
  • Multimodal Analysis: Gemini 2.0 watches the tape using temporal sequence reasoning to analyze the speed, momentum, and string topology
  • Sprint verdict: a chosen 'Spirit Coach' (like the cocky Gojo Satoru - inspired) delivers a high-fidelity verbal verdict via ElevenLabs
  • Proof of Mastery: successfuly landing 'hard' tricks triggers an immutable transaction on the Solana Blockchain, awarding you a digital trophy

How we built it

  • The Brain (Gemini 2.0 Flash): We moved beyond single-frame analysis. We implemented a Temporal Buffer Engine that samples frames at 3 FPS and sends the entire sequence to Gemini. This allows the AI to see the physics of the yoyo's movement over time.
  • The Voice (ElevenLabs Flash v2.5): To achieve premium synchronization, we used ElevenLabs' WebSocket API. The Sensei’s voice is tailored to the specific biomechanical failure detected in the video review.
  • The Chain (Solana Devnet): We integrated a Proof-of-Mastery engine that mints achievement records on the Solana ledger, turning physical skill into a verifiable digital asset.
  • The Analytics (Pandas): A "Grind Chart" tracks every attempt, success rate, and practice density, providing a data-driven path to Black Belt status.
  • The Interface: A premium UI built in Streamlit, featuring glassmorphism, custom CSS, and a "Mastery Scroll" library of professional tricks

Challenges we ran into

  • Temporal Context: Sending a single photo wasn't enough to distinguish a "1.5 Mount" from a "Double or Nothing." We had to pivot our entire architecture to Sequential Frame Buffering, managing memory across the Streamlit event loop while preparing high-density multimodal payloads.
  • The "Sync" War: We realized constant real-time feedback was distracting and caused audio overlap. We engineered a Session-Based Trigger that perfectly synchronizes the Wisdom Log with the ElevenLabs voice-stream at the end of each attempt.
  • Multimodal Latency: Processing 15+ frames simultaneously is heavy. We implemented aggressive image normalization and downsampling (320p) to keep the "Verdict" time under 3 seconds for a full 5-second video clip.

Accomplishments that we're proud of

  • Video-Sequence Reasoning: Successfully utilizing Gemini 2.0's ability to reason across a series of images rather than just static inputs.
  • The "Film Room" UX: Creating a zero-friction interface where a user can record, perform, and receive a mentor's verdict without ever touching a keyboard.
  • On-Chain Achievements: Building a bridge between physical yoyo play and the Solana ecosystem.

What we learned

We learned that in AI coaching, timing is everything. Great mentors don't talk while the student is spinning; they watch the whole movement and then speak with authority. We also mastered the complexities of Asynchronous Frame Buffering in Python.

What's next for YOYO

  • AR Overlay: Using the Gemini analysis to draw the "correct" string path over the user's video replay.
  • Solana cNFTs: Minting full compressed NFTs with actual trick footage metadata.
  • Global Dojo: A leaderboard where users can compare their "Grind Density" charts.

Built With

Share this project:

Updates