Inspiration

As a piano tutor, often times for higher level students, what they require is not a teacher but rather a guide. High level piano students aren't learning symbols or notes, they are learning where they are playing wrong. So why not create a software that does that? I wanted to build something closer to a real piano teacher than a generic score checker.

Most practice tools only tell you whether a note was right or wrong. I wanted a system that could also answer:

  • Did I play the right notes, at the right time?
  • Was my tempo stable?
  • Did my posture look healthy while I was playing?
  • What should I fix next, in plain language?

That idea became Doceo: a performance tutor that combines audio analysis, video analysis, and an AI-generated coaching layer.

What it does

Privtor AI takes in:

  • a reference MIDI file
  • a performance video

Then it:

  1. extracts audio from the video
  2. transcribes the performance into note events
  3. aligns the played notes against the reference score
  4. analyzes posture from sampled video frames
  5. generates feedback and drills
  6. renders the result in a web interface

Output:

  • score annotations
  • piano-roll fallback
  • video playback with pose overlay
  • reference audio playback
  • AI tutor feedback

How we built it

Frontend

  • upload reference MIDI
  • upload performance video
  • run analysis
  • view results

The results page shows:

  • the annotated score
  • a piano-roll fallback
  • A/B playback with the user video and synthesized reference audio
  • posture overlay on top of the video
  • focus areas and drills
  • AI tutor feedback

Backend

The backend lives in api/ and is built with FastAPI.

Pipeline:

  • /midi parses the reference MIDI, exports MusicXML, and synthesizes a reference audio track
  • /video stores the performance video and extracts a mono WAV file with ffmpeg
  • /analyze transcribes the extracted audio with Basic Pitch
  • /align matches the played notes to the reference score with DTW
  • /pose samples video frames with OpenCV + MediaPipe Pose/Hands
  • /tutor feeds the analysis into Gemini and ElevenLabs for spoken coaching

Core Libraries

  • FastAPI for the API layer
  • Next.js and React for the frontend
  • ffmpeg for video-to-audio extraction
  • basic-pitch for audio-to-note transcription
  • pretty_midi and music21 for MIDI/score handling
  • fastdtw for note alignment
  • OpenCV for frame sampling
  • MediaPipe for pose and hand landmark detection
  • Gemini for tutor script generation
  • ElevenLabs for voice output

Challenges we ran into

Converting Video to MIDI With High Accuracy

This was the hardest part of the project.

The raw audio coming from a performance video is messy:

  • room noise leaks into the signal
  • pedal resonance blurs note boundaries
  • timing is not perfectly quantized
  • transcribers can miss short notes or create duplicates
  • performance tempo can drift compared with the reference

To make the transcription usable, I had to add multiple cleanup layers:

  • extract a clean mono track from the video
  • run Basic Pitch to get candidate note events
  • smooth and merge near-duplicate transcribed notes
  • estimate timing offset against synthesized reference audio
  • reference-guide the note cleanup using the score itself
  • write both raw and cleaned MIDI outputs for debugging

Even after transcription, note alignment still needed a second pass. I used DTW to compare the played note stream against the reference, then labeled each event as correct, wrong pitch, missed, extra, early, late, or on-time.

That combination of transcription plus alignment is what makes the output feel coherent instead of noisy.]

Accomplishments that we're proud of

Being able to put all of the moving parts together and able to scrape the audio from video into a actual usable MIDI file as this was extremely difficult.

What we learned

Audio transcription needs cleanup and alignment to be useful.

  • Reference-aware heuristics matter when performances drift from the score.
  • Audio and video analysis solve different parts of the problem.
  • Specific feedback beats a raw score.
  • Structured inputs make AI feedback much better.

What's next for Doceo

  • improve transcription accuracy on noisier recordings
  • support longer performances and bigger excerpts
  • separate left-hand and right-hand analysis
  • improve posture scoring
  • support multi-song sessions
  • track progress across practice attempts
  • deploy the pipeline for reliable scale
  • personalize feedback from recurring mistakes

Built With

Share this project:

Updates