AI agents are being deployed into real workflows - coding, research, analytics - yet most are frozen in time. They ship once and slowly go stale.

The Ruya AI challenge asked a deeper question:

What happens when an AI agent stops waiting for instructions and starts improving itself — in a measurable, statistically validated way?

We built NOVA to answer that.

NOVA is a self-improving multi-agent system where agents:

• Evaluate their own performance • Diagnose specific weaknesses • Generate structured improvement hypotheses • Compete against alternative configurations • Prove improvement on unseen validation data

Every iteration is scored. Every change is tracked. Every improvement must generalise.

Agents don’t just run - they evolve under statistical verification.

How It Works 1. Structured Benchmark Course

Each course is split into:

• Train set (for improvement discovery) • Validation set (for gating promotion) • Hidden test set (for final certification) • Cost & latency constraints • Defined evaluation rubric

We also perform Monte Carlo multi-run evaluation to measure variance and stability.

Example: Tool-Use Reasoning – Weather Agent

This prevents overfitting and ensures generalisation beyond the improvement set.

2.  Multi-Agent Evaluation Engine

Each submission is executed in a controlled environment:

• AWS Bedrock for model execution • Tool calls sandboxed • Structured outputs enforced • Traces logged via Langfuse

We evaluate:

• Schema correctness • Task accuracy • Reliability (variance across repeated runs) • Latency • Token cost • Safety compliance • Generalisation gap

Composite score:

$$ Score = w_1 Accuracy + w_2 Reliability - w_3 Cost - w_4 Latency $$

We also compute:

$$ Generalisation\ Gap = Train\ Score - Validation\ Score $$

Improvement is only accepted if validation and test performance increase.

3.  Controlled Self-Improvement Engine

After evaluation, NOVA does not blindly rewrite prompts.

It:

• Clusters failure cases • Detects tool misuse patterns • Identifies prompt instruction weaknesses • Generates structured hypotheses • Samples configuration candidates (Monte Carlo parameter exploration) • Runs Champion vs Challenger comparison • Applies statistical promotion gating

Each iteration must:

• Improve validation score • Reduce variance • Avoid regression • Maintain cost/latency constraints

The agent progresses:

v0 → v1 → v2 → vN

If an iteration fails validation, it is rejected.

Rollback is automatic.

This prevents trial-and-error drift.

4.  Regression Gate & Certification

Before certification:

• Full benchmark suite is re-run • Monte Carlo repeated executions measure stability • Variance must remain below threshold • Validation + hidden test must both improve • No regression allowed

Only then does the agent earn certification.

Improvement must generalise — not just fit the training set.

Demonstrated Measurable Improvement (Demo)

In our live demo:

v0 Train: 68% Validation: 60% Test: 58%

v1 Train: 75% Validation: 72% Test: 70%

v2 Train: 84% Validation: 82% Test: 81%

Across iterations:

Accuracy increased Variance decreased Latency reduced Cost reduced Generalisation gap narrowed

NOVA proves improvement on unseen data — not just across runs.

What Inspired Us

Today, AI agents are deployed based on demos — not credentials.

There is no accreditation system.

No structured validation.

No statistical confidence.

We wanted to build a framework where agents can:

• Learn • Be evaluated rigorously • Compete • Improve under constraints • Earn certification

Improvement should not be claimed.

It should be statistically validated.

How We Built It

• AWS Bedrock for multi-model execution • Strands for multi-agent orchestration • Langfuse for trace logging, evaluation tracking, prompt versioning • ClickHouse Cloud for benchmark analytics and version comparison • PostgreSQL for version registry • Dockerized benchmark runner for deterministic execution

Challenges We Faced

• Preventing LLM-as-judge hallucination • Designing statistically meaningful scoring • Avoiding benchmark overfitting • Measuring true generalisation • Handling unstable improvements

We addressed this with:

• Deterministic schema validation • Multi-run stability checks (Monte Carlo execution) • Train / validation / hidden test splits • Champion–Challenger gating • Strict regression rejection

NOVA transforms static agents into statistically validated, self-improving systems.

Improvement is not assumed. It is measured, validated, and certified.

Built With

Share this project:

Updates