Doceo

Inspiration

As a piano tutor, often times for higher level students, what they require is not a teacher but rather a guide. High level piano students aren't learning symbols or notes, they are learning where they are playing wrong. So why not create a software that does that? I wanted to build something closer to a real piano teacher than a generic score checker.

Most practice tools only tell you whether a note was right or wrong. I wanted a system that could also answer:

Did I play the right notes, at the right time?
Was my tempo stable?
Did my posture look healthy while I was playing?
What should I fix next, in plain language?

That idea became Doceo: a performance tutor that combines audio analysis, video analysis, and an AI-generated coaching layer.

What it does

Privtor AI takes in:

a reference MIDI file
a performance video

Then it:

extracts audio from the video
transcribes the performance into note events
aligns the played notes against the reference score
analyzes posture from sampled video frames
generates feedback and drills
renders the result in a web interface

Output:

score annotations
piano-roll fallback
video playback with pose overlay
reference audio playback
AI tutor feedback

How we built it

Frontend

upload reference MIDI
upload performance video
run analysis
view results

The results page shows:

the annotated score
a piano-roll fallback
A/B playback with the user video and synthesized reference audio
posture overlay on top of the video
focus areas and drills
AI tutor feedback

Backend

The backend lives in api/ and is built with FastAPI.

Pipeline:

/midi parses the reference MIDI, exports MusicXML, and synthesizes a reference audio track
/video stores the performance video and extracts a mono WAV file with ffmpeg
/analyze transcribes the extracted audio with Basic Pitch
/align matches the played notes to the reference score with DTW
/pose samples video frames with OpenCV + MediaPipe Pose/Hands
/tutor feeds the analysis into Gemini and ElevenLabs for spoken coaching

Core Libraries

FastAPI for the API layer
Next.js and React for the frontend
ffmpeg for video-to-audio extraction
basic-pitch for audio-to-note transcription
pretty_midi and music21 for MIDI/score handling
fastdtw for note alignment
OpenCV for frame sampling
MediaPipe for pose and hand landmark detection
Gemini for tutor script generation
ElevenLabs for voice output

Challenges we ran into

Converting Video to MIDI With High Accuracy

This was the hardest part of the project.

The raw audio coming from a performance video is messy:

room noise leaks into the signal
pedal resonance blurs note boundaries
timing is not perfectly quantized
transcribers can miss short notes or create duplicates
performance tempo can drift compared with the reference

To make the transcription usable, I had to add multiple cleanup layers:

extract a clean mono track from the video
run Basic Pitch to get candidate note events
smooth and merge near-duplicate transcribed notes
estimate timing offset against synthesized reference audio
reference-guide the note cleanup using the score itself
write both raw and cleaned MIDI outputs for debugging

Even after transcription, note alignment still needed a second pass. I used DTW to compare the played note stream against the reference, then labeled each event as correct, wrong pitch, missed, extra, early, late, or on-time.

That combination of transcription plus alignment is what makes the output feel coherent instead of noisy.]

Accomplishments that we're proud of

Being able to put all of the moving parts together and able to scrape the audio from video into a actual usable MIDI file as this was extremely difficult.