Inspiration
When we read the problem statement, the first ideas were the obvious ones: a Socratic RAG tutor, or an AI that spits out explainer videos. But be honest, nobody wants to read another wall of text, and nobody wants to sit through AI generated slop. What students actually do is learn by doing.
The everyday challenge is not a lack of information, it is being stuck. You hit a step in a problem set you cannot get past. Or you open software nobody taught you, like Fusion 360 for CAD, and the only help on offer is a one hour YouTube tutorial you are never going to watch. Google gives you ten tabs. ChatGPT gives you a paragraph you then have to map back onto your own screen.
The thing that actually gets you unstuck is small and physical. It is someone leaning over, looking at what is in front of you, and pointing at the exact spot: "you missed a step right here." That act of pointing at the thing on your screen is what every AI tool is missing. So we built an agent that does it.
What it does
Buddy is an AI study buddy that lives next to your cursor. You hold a hotkey and just talk to it, like a friend who happens to know the subject.
You ask "why is my answer for question 3 wrong?" or "how do I turn this sketch into a solid?" and Buddy looks at your screen, finds the exact spot you are stuck on, and draws right on top of your work. It circles the mistake, draws an arrow to where you go next, underlines the rule you forgot, and writes a quick note. At the same time it explains it out loud in a sentence or two, not a wall of text.
The drawings fade once you have read them, or the moment you ask your next question. It works on top of any app: a problem set, a PDF, a CAD window, a website. Learn by doing, with someone pointing at the thing.
How we built it
The brain is Agnes (agnes-2.0-flash). When you hold the hotkey:
- You talk, and your speech is transcribed (Agnes does not do audio yet, so speech in and out are chained around it, the way the workshop suggested).
- When you let go, we take a screenshot.
- Agnes reads the screenshot, figures out what you are stuck on, and tells us exactly where to draw and what to say. The part that actually teaches is all Agnes: the seeing, the reasoning, and the pointing.
- We draw the circles and arrows on a see-through layer that floats over everything and never blocks your clicks.
- A voice provider reads the explanation out loud.
It is bring your own key. Agnes is the default brain. You can swap in Claude, GPT, or a free local model from settings, but Agnes is what runs.
Challenges we ran into
The whole thing only works if it circles the right line. A tutor that circles the wrong step is worse than no tutor at all.
So the real question was: can Agnes point at a pixel, not just describe it? We fed it real screenshots and asked it to point. It came back close, around 30 pixels on clean targets, good but not tight enough to trust on a dense worksheet. The easy move would have been to throw the pointing at a different model, but then Agnes would not really be the brain anymore.
Instead we built a harness around Agnes. When its first answer is not pixel tight, we draw a numbered grid on the screenshot and ask Agnes "which cell?", then zoom into that cell and ask again. Two passes and the circle lands on the right line. One model, made exact by the harness, not by outsourcing the vision.
The other hard parts were making sure the voice never reads coordinates out loud (nobody wants to hear "point, 340, comma, 210") and making the drawings land in the right place on laptops with odd screen scaling. We rewrote those until they were solid.
What we learned
The model call is the easy 20 percent. The other 80 percent is everything around it that nobody sees: the latency, the see-through overlay that works on any app, keeping the voice clean, the grid harness that lands the pixel. That invisible work is the difference between something that feels like a person sitting next to you and a chatbot in a box.
And you do not need a pile of models. You need one good one and a good harness around it. Agnes does the seeing and the reasoning, the harness makes its pointing exact, and that is the whole agent.
Log in or sign up for Devpost to join the conversation.