YOYO | Devpost

Inspiration

Learning yoyo tricks is notoriously difficult. Static YouTube tutorials require you to stop, rewind, and rewatch while holding a spinning object. We want to build a coach that has eyes, a voice, and a memory. Instead of just an AI that just 'nags' you constantly, we built a Film Room Review system - exactly how Olympic athletes train - where a master watches your entire attempt and provides deep, biomechanical feedback only when you are ready to listen

What it does

Yoyo-Sensei Sync is a multimodal AI coaching platform. It features a unique 'Record-and-Review' flow:

Step into the ring: record a 3-10 second clip of your attempt
Multimodal Analysis: Gemini 2.0 watches the tape using temporal sequence reasoning to analyze the speed, momentum, and string topology
Sprint verdict: a chosen 'Spirit Coach' (like the cocky Gojo Satoru - inspired) delivers a high-fidelity verbal verdict via ElevenLabs
Proof of Mastery: successfuly landing 'hard' tricks triggers an immutable transaction on the Solana Blockchain, awarding you a digital trophy

How we built it

The Brain (Gemini 2.0 Flash): We moved beyond single-frame analysis. We implemented a Temporal Buffer Engine that samples frames at 3 FPS and sends the entire sequence to Gemini. This allows the AI to see the physics of the yoyo's movement over time.
The Voice (ElevenLabs Flash v2.5): To achieve premium synchronization, we used ElevenLabs' WebSocket API. The Sensei’s voice is tailored to the specific biomechanical failure detected in the video review.
The Chain (Solana Devnet): We integrated a Proof-of-Mastery engine that mints achievement records on the Solana ledger, turning physical skill into a verifiable digital asset.
The Analytics (Pandas): A "Grind Chart" tracks every attempt, success rate, and practice density, providing a data-driven path to Black Belt status.
The Interface: A premium UI built in Streamlit, featuring glassmorphism, custom CSS, and a "Mastery Scroll" library of professional tricks

Challenges we ran into

Temporal Context: Sending a single photo wasn't enough to distinguish a "1.5 Mount" from a "Double or Nothing." We had to pivot our entire architecture to Sequential Frame Buffering, managing memory across the Streamlit event loop while preparing high-density multimodal payloads.
The "Sync" War: We realized constant real-time feedback was distracting and caused audio overlap. We engineered a Session-Based Trigger that perfectly synchronizes the Wisdom Log with the ElevenLabs voice-stream at the end of each attempt.
Multimodal Latency: Processing 15+ frames simultaneously is heavy. We implemented aggressive image normalization and downsampling (320p) to keep the "Verdict" time under 3 seconds for a full 5-second video clip.

Accomplishments that we're proud of

Video-Sequence Reasoning: Successfully utilizing Gemini 2.0's ability to reason across a series of images rather than just static inputs.
The "Film Room" UX: Creating a zero-friction interface where a user can record, perform, and receive a mentor's verdict without ever touching a keyboard.
On-Chain Achievements: Building a bridge between physical yoyo play and the Solana ecosystem.

What we learned

We learned that in AI coaching, timing is everything. Great mentors don't talk while the student is spinning; they watch the whole movement and then speak with authority. We also mastered the complexities of Asynchronous Frame Buffering in Python.

What's next for YOYO

AR Overlay: Using the Gemini analysis to draw the "correct" string path over the user's video replay.
Solana cNFTs: Minting full compressed NFTs with actual trick footage metadata.
Global Dojo: A leaderboard where users can compare their "Grind Density" charts.

Built With

asyncio
css
elevenlabs
gemini
javascript
numpy
opencv
pandas
pillow
python
solana
streamlit
websockets

Updates

Quý Dương Nguyễn started this project — Feb 07, 2026 12:52 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.