Inspiration

Inspired by Kevin’s upbringing in a Montessori environment and insights from his mother, a kindergarten teacher, we identified a critical gap in early childhood education: the struggle to balance independent exploration with specialized instruction. Teachers are often forced into "one-size-fits-all" teaching, leaving them unable to provide the targeted attention required to master foundational metrics like stroke ordering and left-to-right writing flow. We built our project to automate the tracking of these nuanced learning metrics, allowing students to practice and receive real-time feedback at their own pace. By bridging the gap between independent study and expert observation, the platform empowers students to develop essential skills while providing teachers with the data-driven insights necessary to intervene effectively where they are needed most.

What it does

Phoneme is a hands-on literacy experience for early learners. It starts in a calm, Montessori-style object box: kids pick a 3D object, then spell its name by placing letter tiles in order. When the word is complete, they say the word out loud before moving on, so reading, writing, and speaking stay linked. From there, the app guides them to the drawing pad, where they practice writing the same word with on-screen guidance that shows stroke direction and order. Along the way, short prompts and gentle feedback keep the task feel like play, not pressure. For teachers, Phoneme supports the classroom side of that flow: set up a class, start a focused session on a student’s device, and review their student’s progress so instruction stays organized.

How we built it

Phoneme is a TypeScript web app: the frontend is built with Vite and includes several pages (drawing pad, object box, and teacher dashboard), with Three.js powering the 3D object box. The backend runs on Cloudflare Workers and exposes REST-style /api routes that the frontend calls with normal fetch requests, often with cookies for signed-in teachers. The worker stores structured data in Cloudflare D1, a SQL database, and uses Cloudflare KV for quick pronunciation lookups. Optional integrations like Tripo (3D generation), ElevenLabs (speech), SpeechAce (pronunciation scoring), and Anthropic (AI helpers) are called from the server so API keys stay off the browser. In local development the frontend dev server forwards API traffic to the worker. In production the same worker serves both the static site and the API so everything lines up as one cohesive app.

Challenges we ran into

  1. The handwriting coach was technically right and completely useless For handwriting, we compared a kid’s stroke to a template and turned the mismatch into spoken feedback. The bug: when a kid drew a vertical line instead of a lowercase c, the coach would say: “Start the c a little lower down.” That was geometrically true, but useless. The real issue was: “You drew a line. A c needs to curve.” The cause was normalization. We normalized the kid’s stroke to its own bounding box and the template to its own bounding box. So a line and a curve ended up in different coordinate frames, and the system overreacted to endpoint position instead of shape. The fix was to normalize both strokes against the template’s box, then prioritize shape errors over position nudges. Lesson: if you compare two normalized things, be very clear whose frame they live in.

  2. The LLM coach kept inverting its advice For subtle shape errors, we sent the kid’s stroke and the correct stroke to a vision LLM and asked for a one-sentence coaching tip. About 10% of the time, it flipped them. The kid would draw a line where a curve belonged, and the model would say something like: “Make it straighter.” The images were in the right order. The model just sometimes lost track of which image was “yours” and which was “correct.” The fix was to stop sending two separate images. We rendered one side-by-side PNG with big labels baked into the pixels: YOUR STROKE on the left, CORRECT on the right. Then we told the model to trust the labels in the image. We also skipped the LLM entirely for obvious cases. If we already knew the stroke was backwards, perpendicular, or missing, we used a canned tip. Lesson: if a vision model needs to know which thing is which, put that information in the image itself.

  3. TTS could not say phonemes We needed prompts like: “Find the letter that says /fff/.” But ElevenLabs, Google, and OpenAI TTS kept reading isolated f as “eff.” We tried fff, fuh, /f/, “f as in fish,” SSML, and prompt hacks. Nothing was reliable. The fix was to stop asking TTS to say phonemes. We used real phoneme recordings and stored them as static audio files. TTS says the carrier sentence — “Find the letter that says…” — then we splice in the real /F/ recording with Web Audio so it plays without a gap. Static recordings were less fancy, but they were correct every time. Lesson: for phonics, “close enough” audio is not close enough.

Accomplishments that we're proud of

We turned a real classroom pain point into something you can actually use: a calm, kid-first flow that connects spelling, speaking, and handwriting instead of treating them like separate worksheets. We’re proud we shipped a working end-to-end experience, with teacher setup, student sessions, practice tools, and enough signal for teachers to see where kids are struggling, while keeping the UI gentle enough for kindergarteners. We’re especially proud it feels less like “another app” and more like a classroom material kids can actually play with while they learn.

What we learned

What stuck with us is that early literacy rarely comes down to flashy features. Instead, it’s built from small, teachable habits, like stroke order and moving across the page in a consistent direction. We also discovered that teachers aren’t really asking for more noise; they’re asking for a clearer signal about who actually needs help right now. On the technical side, we learned to stay pragmatic about tools. In our tests, a general‑purpose model (Gemini) sometimes gave us faster, more flexible phoneme-style signal than a specialized speech API (SpeechAce)—which was a good reminder that “specialized” isn’t automatically “best,” and you validate with real audio and real prompts. Finally, 3D in the browser has real constraints: remote models, textures, and mesh sizes mean you need loading states, sensible scaling, and fallbacks. While we were iterating, simple procedural placeholders were often better than a blank screen because they made debugging faster and kept the experience feeling alive even when generation was still catching up.

What's next for Phoneme

Looking ahead, we’d like to move from our current slightly cursive handwriting style toward a more standardized alphabet and stroke dataset, so guidance feels clearer and closer to what many kindergarten classrooms already teach. We also want smarter personalization that doesn’t burn through credits—things like caching repetitive, but successful outputs and giving teachers a small, curated library of approved objects kids can revisit. On the learning side, we’d deepen the feedback so it speaks more directly to everyday handwriting habits (spacing, direction, stroke order), and we’d run pilots in real classrooms to tune pacing, language, and accessibility until the experience feels natural for every learner, not just the “easy” cases.

Built With

Share this project:

Updates