YATA_vioLin

our photo!

Inspiration

Violin practice is brutally honest but often frustratingly vague: you feel something is off, you replay it twenty times, and you still can’t tell what to fix. We were inspired by the gap between how a master teacher hears a performance and how most students practice alone. We wanted to give learners that teacher-level clarity every time they play, using modern AI tools—not just to score them, but to turn their playing into precise, actionable feedback.

What it does

YATA_vioLin takes a master “good” performance and a student “imperfect” performance of the same passage and compares them at the note level. It transcribes audio into MIDI with a state-of-the-art violin transcription model, aligns the two takes, pairs each reference note to a student note using a custom DTW-based matcher, and scores musically relevant deviations: wrong pitches, late or early attacks, dragged or clipped releases. Instead of overreacting to tiny accumulated drift, it finds continuous patches where the playing truly breaks down, then generates a clean report, a visual timeline, and an LLM-written coaching paragraph that tells you exactly where and how to practice next.

How we built it

We built an end-to-end pipeline: audio is first converted to a standard format, then passed through an SOTA open-sourced deep neural network (MUSC, https://github.com/MTG/violin-transcription) for violin transcription to obtain structured MIDI notes. We parse each note into features (pitch, onset, offset, duration) and run a Dynamic Time Warping–based alignment tailored to musical phrases, followed by one-to-one greedy matching that can tolerate missing or extra notes. On top of the matched pairs, we compute severity metrics and detect “error patches” where timing and pitch simultaneously break down. Finally, we wrap everything in a Gradio interface with visual plots and summaries, and use LangChain to call a large language model (Gemini 2.0 via LangChain) that turns the raw error report into friendly, teacher-style feedback.

Challenges we ran into

The hardest part was separating “harmless drift” from “real errors” in a way that felt musical instead of purely statistical. A student take can be mostly correct but contain a few disastrous moments, and naive DTW will smear those mistakes across the whole piece. We also had to design matching logic that gracefully handles missing and extra notes without collapsing the alignment, and tune thresholds so the system behaves like a good teacher: firm and explicit about true breakdowns, quiet about noise and tiny, inevitable imperfections.

Accomplishments that we're proud of

We’re proud that our custom DTW + greedy matching pipeline consistently produces stable note pairings even on messy, real student performances. Our error-patch detection surfaces what a musician would actually call a mistake—clusters of wrong pitch and timing—rather than random spikes in a metric. On top of that, we successfully combined three layers of modern AI in one flow: a SOTA deep learning model for transcription, a bespoke alignment/scoring algorithm, and an LLM that translates raw numbers into human, pedagogical language. The result is a polished demo that goes from upload to interpretable, teacher-like feedback in one pass.

What we learned

We learned that musical feedback is not just about counting differences; it’s about ranking them by relevance and context. The same deviation can be trivial in a rubato phrase and catastrophic in a metronomic passage, so our metrics and thresholds had to reflect musical intuition, not just math. We also saw how powerful it is to combine symbolic analysis with an LLM: when the underlying scores are interpretable, the language model can produce feedback that users immediately recognize as matching what they hear and feel.

What's next for YATA_vioLin

Next, we want to fully embrace audio-in/audio-out so any student can just record on a phone and get the same level of structured feedback, without touching MIDI. We plan to add style-aware analysis for intonation trends and phrasing stability, personalize thresholds by skill level, and track progress over time so the system can recommend targeted drills rather than just pointing out errors.