Inspiration

Most AI question-generation tools focus on producing outputs, but provide limited visibility into whether those outputs are actually comparable in difficulty, structure, or reasoning depth.

While working with inference-style reading questions, I kept seeing the same issue: AI could generate many variations, but there was no reliable way to evaluate, compare, or control them beyond manual review.

This project started from one question:
What if generation was not the main problem, but evaluation and control were?


What it does

Stable Difficulty Generation Engine is a demo system that evaluates and controls AI-generated inference questions instead of treating generation as a black box.

Given one reference reading question, it:

  • Generates candidates with an LLM
  • Audits each candidate across multiple difficulty dimensions
  • Lets users steer outputs via difficulty and inference controls
  • Classifies outputs as accepted, soft-accepted, or rejected

The goal is not maximum creativity; it is observable, reproducible, adjustable generation.


How I built it

The system uses an evaluation-first pipeline:

  • Frontend: Next.js UI with sliders (difficulty) and inference-style controls
  • Backend: Node.js (TypeScript) scoring and validation engine
  • Generation: DigitalOcean Gradientâ„¢ AI Serverless Inference
  • Difficulty axes:
    • L: Lexical difficulty
    • S: Structural complexity
    • A: Choice ambiguity
    • R: Reasoning depth
  • Candidates are audited before being shown to users

Deployment is split on DigitalOcean App Platform:

  • Static Site: frontend
  • Web Service: backend

This keeps the demo architecture simple and reproducible.


Challenges I ran into

  • Deployment routing: Separating frontend and backend paths cleanly on App Platform.
  • Embedding strategy: Current Gradient embedding workflows are optimized for KB/RAG + OpenSearch, while this project needs real-time, on-the-fly similarity scoring for evaluation loops.
    So generation runs on Gradient, while embeddings remain on an external provider for now.
  • Soft-accept design: I intentionally omit similarity breakdown values in soft-accept cases to avoid presenting unstable diagnostics as definitive metrics.

Accomplishments that I'm proud of

  • Made evaluation the primary control mechanism, not an afterthought
  • Exposed difficulty and reasoning dimensions explicitly
  • Demonstrated parameter steering with measurable effects
  • Built a minimal but explainable demo architecture

I am also proud of making failure modes visible.
When constraints are mutually incompatible (e.g., strict similarity vs. high reasoning shift), the system treats generation as a constrained optimization problem, surfacing retries, soft-accept states, and rejections instead of masking instability.


What I learned

Many "AI generation problems" are actually evaluation problems.

Separating generation from evaluation made it possible to:

  • Detect weak inference items that look plausible on the surface
  • Explain why outputs are accepted, warned, or rejected
  • Build systems that are easier to trust and reason about

I also learned that clear, observable metrics often create more reliability than adding generation complexity.


What's next for Stable Difficulty Generation Engine

  • Expand inference types and domains
  • Improve embedding options for real-time similarity scoring
  • Refine soft-accept feedback in the UI
  • Explore assessment design and learner workflows where comparable difficulty matters

The core idea stays the same:
AI generation should be controllable because it is evaluated first.

Built With

Share this project:

Updates