Inspiration
Most AI question-generation tools focus on producing outputs, but provide limited visibility into whether those outputs are actually comparable in difficulty, structure, or reasoning depth.
While working with inference-style reading questions, I kept seeing the same issue: AI could generate many variations, but there was no reliable way to evaluate, compare, or control them beyond manual review.
This project started from one question:
What if generation was not the main problem, but evaluation and control were?
What it does
Stable Difficulty Generation Engine is a demo system that evaluates and controls AI-generated inference questions instead of treating generation as a black box.
Given one reference reading question, it:
- Generates candidates with an LLM
- Audits each candidate across multiple difficulty dimensions
- Lets users steer outputs via difficulty and inference controls
- Classifies outputs as accepted, soft-accepted, or rejected
The goal is not maximum creativity; it is observable, reproducible, adjustable generation.
How I built it
The system uses an evaluation-first pipeline:
- Frontend: Next.js UI with sliders (difficulty) and inference-style controls
- Backend: Node.js (TypeScript) scoring and validation engine
- Generation: DigitalOcean Gradientâ„¢ AI Serverless Inference
- Difficulty axes:
- L: Lexical difficulty
- S: Structural complexity
- A: Choice ambiguity
- R: Reasoning depth
- Candidates are audited before being shown to users
Deployment is split on DigitalOcean App Platform:
- Static Site: frontend
- Web Service: backend
This keeps the demo architecture simple and reproducible.
Challenges I ran into
- Deployment routing: Separating frontend and backend paths cleanly on App Platform.
- Embedding strategy: Current Gradient embedding workflows are optimized for KB/RAG + OpenSearch, while this project needs real-time, on-the-fly similarity scoring for evaluation loops.
So generation runs on Gradient, while embeddings remain on an external provider for now. - Soft-accept design: I intentionally omit similarity breakdown values in soft-accept cases to avoid presenting unstable diagnostics as definitive metrics.
Accomplishments that I'm proud of
- Made evaluation the primary control mechanism, not an afterthought
- Exposed difficulty and reasoning dimensions explicitly
- Demonstrated parameter steering with measurable effects
- Built a minimal but explainable demo architecture
I am also proud of making failure modes visible.
When constraints are mutually incompatible (e.g., strict similarity vs. high reasoning shift), the system treats generation as a constrained optimization problem, surfacing retries, soft-accept states, and rejections instead of masking instability.
What I learned
Many "AI generation problems" are actually evaluation problems.
Separating generation from evaluation made it possible to:
- Detect weak inference items that look plausible on the surface
- Explain why outputs are accepted, warned, or rejected
- Build systems that are easier to trust and reason about
I also learned that clear, observable metrics often create more reliability than adding generation complexity.
What's next for Stable Difficulty Generation Engine
- Expand inference types and domains
- Improve embedding options for real-time similarity scoring
- Refine soft-accept feedback in the UI
- Explore assessment design and learner workflows where comparable difficulty matters
The core idea stays the same:
AI generation should be controllable because it is evaluated first.
Built With
- digitalocean
- gradient
- next.js
- node.js
- restapis
- typescript
Log in or sign up for Devpost to join the conversation.