Inspiration
Learning yoyo tricks is notoriously difficult. Static YouTube tutorials require you to stop, rewind, and rewatch while holding a spinning object. We want to build a coach that has eyes, a voice, and a memory. Instead of just an AI that just 'nags' you constantly, we built a Film Room Review system - exactly how Olympic athletes train - where a master watches your entire attempt and provides deep, biomechanical feedback only when you are ready to listen
What it does
Yoyo-Sensei Sync is a multimodal AI coaching platform. It features a unique 'Record-and-Review' flow:
- Step into the ring: record a 3-10 second clip of your attempt
- Multimodal Analysis: Gemini 2.0 watches the tape using temporal sequence reasoning to analyze the speed, momentum, and string topology
- Sprint verdict: a chosen 'Spirit Coach' (like the cocky Gojo Satoru - inspired) delivers a high-fidelity verbal verdict via ElevenLabs
- Proof of Mastery: successfuly landing 'hard' tricks triggers an immutable transaction on the Solana Blockchain, awarding you a digital trophy
How we built it
- The Brain (Gemini 2.0 Flash): We moved beyond single-frame analysis. We implemented a Temporal Buffer Engine that samples frames at 3 FPS and sends the entire sequence to Gemini. This allows the AI to see the physics of the yoyo's movement over time.
- The Voice (ElevenLabs Flash v2.5): To achieve premium synchronization, we used ElevenLabs' WebSocket API. The Sensei’s voice is tailored to the specific biomechanical failure detected in the video review.
- The Chain (Solana Devnet): We integrated a Proof-of-Mastery engine that mints achievement records on the Solana ledger, turning physical skill into a verifiable digital asset.
- The Analytics (Pandas): A "Grind Chart" tracks every attempt, success rate, and practice density, providing a data-driven path to Black Belt status.
- The Interface: A premium UI built in Streamlit, featuring glassmorphism, custom CSS, and a "Mastery Scroll" library of professional tricks
Challenges we ran into
- Temporal Context: Sending a single photo wasn't enough to distinguish a "1.5 Mount" from a "Double or Nothing." We had to pivot our entire architecture to Sequential Frame Buffering, managing memory across the Streamlit event loop while preparing high-density multimodal payloads.
- The "Sync" War: We realized constant real-time feedback was distracting and caused audio overlap. We engineered a Session-Based Trigger that perfectly synchronizes the Wisdom Log with the ElevenLabs voice-stream at the end of each attempt.
- Multimodal Latency: Processing 15+ frames simultaneously is heavy. We implemented aggressive image normalization and downsampling (320p) to keep the "Verdict" time under 3 seconds for a full 5-second video clip.
Accomplishments that we're proud of
- Video-Sequence Reasoning: Successfully utilizing Gemini 2.0's ability to reason across a series of images rather than just static inputs.
- The "Film Room" UX: Creating a zero-friction interface where a user can record, perform, and receive a mentor's verdict without ever touching a keyboard.
- On-Chain Achievements: Building a bridge between physical yoyo play and the Solana ecosystem.
What we learned
We learned that in AI coaching, timing is everything. Great mentors don't talk while the student is spinning; they watch the whole movement and then speak with authority. We also mastered the complexities of Asynchronous Frame Buffering in Python.
What's next for YOYO
- AR Overlay: Using the Gemini analysis to draw the "correct" string path over the user's video replay.
- Solana cNFTs: Minting full compressed NFTs with actual trick footage metadata.
- Global Dojo: A leaderboard where users can compare their "Grind Density" charts.
Built With
- asyncio
- css
- elevenlabs
- gemini
- javascript
- numpy
- opencv
- pandas
- pillow
- python
- solana
- streamlit
- websockets
Log in or sign up for Devpost to join the conversation.