Inspiration
Writing feedback in schools is broken by time, not talent. A teacher with 30 students can spend an entire weekend marking essays and still return work with comments like "good structure, develop your ideas" — feedback so generic it teaches nothing. By the time it lands on a student's desk, the moment to improve has passed.
We wanted to build the thing that should already exist: a system that reads a student's essay the way a trained marker does — criterion by criterion, with evidence from the actual text — and returns that feedback in seconds, not days. Not a grammar checker. Not a readability score. Real rubric-based assessment.
The harder version of that problem is what kept us up at night: how do you make AI feedback trustworthy enough that a teacher would actually show it to a student and stand behind it?
What it does
Markly is a full writing practice platform built around AI marking.
Students submit a piece of writing — narrative, persuasive, or expository — and get back:
- A score on every rubric criterion, with evidence quoted directly from their text
- What they did well, what to fix, and a specific micro-goal for next time
- A corrected version of their essay with errors annotated, alongside a strength-highlighted clean draft
- Sentence-level coaching hints on specific paragraphs
- An integrity signal — a heuristic estimate of plagiarism risk and AI authorship likelihood
The practice system keeps students improving between submissions. Daily writing prompts are generated across four frameworks (NAPLAN, IB, Common Core US, AP Language/Literature) for every year level and genre. An adaptive planner analyses a student's last ten submissions, identifies their weakest criteria, and generates a personalised two-week activity schedule — each exercise paired with a model answer and annotated highlights explaining exactly why it scores full marks.
Teachers and parents get a separate management layer:
- Assign tasks with deadlines and timer presets
- Set score and streak goals
- Build custom rubrics with AI-generated model answers
- Create reward catalogs students can redeem with their credits
- Receive weekly digest emails on student activity
When a student completes an assigned task, credits come from the teacher's pool — not the student's.
How we built it
The stack is Next.js 15 with React 19 on the frontend, deployed to Vercel. The backend is NestJS running on Fastify, hosted on a Google Cloud VM. PostgreSQL on Neon handles persistence through a 26-model Prisma schema. Firebase handles authentication. LLM calls go through an OpenAI-compatible SDK pointed at an Ollama endpoint running gemma4:31b-cloud.
The marking pipeline is the core of the system. When a submission comes in, we select the right rubric for the student's framework, year level, and genre — pulling from seeded NAPLAN scoring folder data, or from one of twelve built-in framework rubric variants. We compact that into a dense prompt snippet and run three parallel marking passes at different calibration levels:
- Normal — balanced, follows descriptors exactly
- Generous — awards higher adjacent bands when evidence is plausible
- Picky — requires explicit sustained evidence before awarding higher bands
This gives teachers a score range rather than a falsely precise single number. The display score is the Normal pass; the range is [Picky, Generous].
The adaptive plan uses Jaccard similarity on tokenised submission text to build a criterion performance map across the student's history, then feeds their weakest areas into a prompt that generates a structured two-week schedule with model answers for each activity.
The credit system prices marking dynamically by word count — scaling from 90 to 150 credits with tier-specific peaks at 500, 750, or 1000 words — so heavy users pay proportionally more regardless of subscription level.
Challenges we ran into
Making the AI mark strictly
LLMs are optimistic by default. Getting consistent, rubric-anchored scores — without inflating marks — required significant prompt engineering. The system prompt has to be extremely explicit: quote evidence under 20 words, full marks only for flawless work, never invent criteria. Even then, calibration drift between model versions required the three-pass scoring system as a structural correction rather than a prompt-level fix.
Rubric fidelity across frameworks
NAPLAN narrative has "Character and Setting." NAPLAN persuasive has "Persuasive Devices." IB has four criteria scored out of 8. AP Literature has four different criteria scored out of 4. Each combination of framework, genre, and year level needs the right rubric slice, and getting that selection logic right — with sane fallbacks when data is missing — took more edge cases than expected.
Trust
The hardest non-technical problem. A teacher won't show AI feedback to a student unless they believe it's fair. Every design decision — showing evidence quotes, displaying calibration ranges, flagging inapplicable criteria as N/A rather than penalising the student, surfacing the band descriptor the AI used — exists to make the reasoning auditable.
Auth domain fragmentation
Deploying across Vercel and a custom domain surfaced a Firebase auth iframe 404 that silently broke all sign-in methods. The fix was one environment variable, but diagnosing it required tracing through Firebase's auth handshake to find which domain was being used and why.
Accomplishments that we're proud of
- A marking pipeline that returns criterion-level feedback with evidence quotes in under 30 seconds, across five international writing frameworks
- Three-pass calibration that gives a principled score range instead of a single AI guess
- An adaptive plan that doesn't say "practice more" — it says "your Cohesion is at 58% across your last ten submissions; here are five specific activities this week, each with a model answer showing you exactly what full marks looks like"
- A credit system with dynamic word-count pricing, teacher-pool billing for assigned tasks, and daily tier-based refresh — all handled in atomic database transactions
- Custom rubrics where teachers define their own criteria, the AI generates a model answer across all of them, and students can be marked against it
- Inapplicable criteria handled correctly: if a narrative-only criterion appears on a persuasive submission, it gets full marks and is flagged as N/A — the student is never penalised for the rubric not fitting
What we learned
Prompt engineering is product design. A single sentence in a system prompt changes scores by entire bands. We learned to treat the AI's instructions with the same rigour as a UI specification — iterated, tested against edge cases, and version-controlled.
We also learned that the ceiling on EdTech AI tools isn't the model quality — it's teacher trust. The question isn't "can the AI mark accurately?" It's "would a teacher stake their professional reputation on showing this to a student?" That reframe changed almost every UI and feedback design decision we made.
Finally: dynamic pricing is deceptively hard to get right. Word-count-based credit scaling sounds simple until you're handling tier-specific peaks, atomic daily refresh, teacher-pool deductions, and refund logic for failed marking jobs simultaneously.
What's next for Markly Writing
- Live classroom mode — teacher broadcasts a prompt, students write simultaneously, results appear on a live class dashboard as they come in
- Voice feedback — text-to-speech narration of the marking report for younger students and accessibility
- Exam simulation — timed, distraction-free writing mode that mirrors real test conditions, locks the interface, and auto-submits at the deadline
- Richer LMS integrations — Google Classroom and Canvas sync so teachers can assign, collect, and return marked work without leaving their existing tools
- Multi-language support — the framework architecture already handles different rubric systems; extending to non-English assessment frameworks is the natural next step
Built With
- fastify
- firebase
- gemma4
- google-cloud
- gsap
- nestjs
- next.js
- ollama
- postgresql
- prisma
- radix-ui
- react
- recharts
- stripe
- tailwindcss
- tanstack-query
- typescript
- vercel
- zustand
Log in or sign up for Devpost to join the conversation.