Stable Difficulty Generation Engine

Stable Difficulty Generation Engine

Inspiration

Most AI question-generation tools focus on producing outputs, but provide limited visibility into whether those outputs are actually comparable in difficulty, structure, or reasoning depth.

While working with inference-style reading questions, I kept seeing the same issue: AI could generate many variations, but there was no reliable way to evaluate, compare, or control them beyond manual review.

This project started from one question:
What if generation was not the main problem, but evaluation and control were?

What it does

Stable Difficulty Generation Engine is a demo system that evaluates and controls AI-generated inference questions instead of treating generation as a black box.

Given one reference reading question, it:

Generates candidates with an LLM
Audits each candidate across multiple difficulty dimensions
Lets users steer outputs via difficulty and inference controls
Classifies outputs as accepted, soft-accepted, or rejected

The goal is not maximum creativity; it is observable, reproducible, adjustable generation.

How I built it

The system uses an evaluation-first pipeline:

Frontend: Next.js UI with sliders (difficulty) and inference-style controls
Backend: Node.js (TypeScript) scoring and validation engine
Generation: DigitalOcean Gradient™ AI Serverless Inference
Difficulty axes:
- L: Lexical difficulty
- S: Structural complexity
- A: Choice ambiguity
- R: Reasoning depth
Candidates are audited before being shown to users

Deployment is split on DigitalOcean App Platform:

Static Site: frontend
Web Service: backend

This keeps the demo architecture simple and reproducible.

Challenges I ran into

Deployment routing: Separating frontend and backend paths cleanly on App Platform.
Embedding strategy: Current Gradient embedding workflows are optimized for KB/RAG + OpenSearch, while this project needs real-time, on-the-fly similarity scoring for evaluation loops.
So generation runs on Gradient, while embeddings remain on an external provider for now.
Soft-accept design: I intentionally omit similarity breakdown values in soft-accept cases to avoid presenting unstable diagnostics as definitive metrics.

Accomplishments that I'm proud of

Made evaluation the primary control mechanism, not an afterthought
Exposed difficulty and reasoning dimensions explicitly
Demonstrated parameter steering with measurable effects
Built a minimal but explainable demo architecture

I am also proud of making failure modes visible.
When constraints are mutually incompatible (e.g., strict similarity vs. high reasoning shift), the system treats generation as a constrained optimization problem, surfacing retries, soft-accept states, and rejections instead of masking instability.

What I learned

Many "AI generation problems" are actually evaluation problems.

Separating generation from evaluation made it possible to:

Detect weak inference items that look plausible on the surface
Explain why outputs are accepted, warned, or rejected
Build systems that are easier to trust and reason about

I also learned that clear, observable metrics often create more reliability than adding generation complexity.

What's next for Stable Difficulty Generation Engine

Expand inference types and domains
Improve embedding options for real-time similarity scoring
Refine soft-accept feedback in the UI
Explore assessment design and learner workflows where comparable difficulty matters

The core idea stays the same:
AI generation should be controllable because it is evaluated first.

Built With

digitalocean
gradient
next.js
node.js
restapis
typescript

Updates

Makiko Makiko Ohashi started this project — Feb 20, 2026 04:58 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.