AI agents are being deployed into real workflows - coding, research, analytics - yet most are frozen in time. They ship once and slowly go stale.
The Ruya AI challenge asked a deeper question:
What happens when an AI agent stops waiting for instructions and starts improving itself — in a measurable, statistically validated way?
We built NOVA to answer that.
NOVA is a self-improving multi-agent system where agents:
• Evaluate their own performance • Diagnose specific weaknesses • Generate structured improvement hypotheses • Compete against alternative configurations • Prove improvement on unseen validation data
Every iteration is scored. Every change is tracked. Every improvement must generalise.
Agents don’t just run - they evolve under statistical verification.
⸻
How It Works 1. Structured Benchmark Course
Each course is split into:
• Train set (for improvement discovery) • Validation set (for gating promotion) • Hidden test set (for final certification) • Cost & latency constraints • Defined evaluation rubric
We also perform Monte Carlo multi-run evaluation to measure variance and stability.
Example: Tool-Use Reasoning – Weather Agent
This prevents overfitting and ensures generalisation beyond the improvement set.
⸻
2. Multi-Agent Evaluation Engine
Each submission is executed in a controlled environment:
• AWS Bedrock for model execution • Tool calls sandboxed • Structured outputs enforced • Traces logged via Langfuse
We evaluate:
• Schema correctness • Task accuracy • Reliability (variance across repeated runs) • Latency • Token cost • Safety compliance • Generalisation gap
Composite score:
$$ Score = w_1 Accuracy + w_2 Reliability - w_3 Cost - w_4 Latency $$
We also compute:
$$ Generalisation\ Gap = Train\ Score - Validation\ Score $$
Improvement is only accepted if validation and test performance increase.
⸻
3. Controlled Self-Improvement Engine
After evaluation, NOVA does not blindly rewrite prompts.
It:
• Clusters failure cases • Detects tool misuse patterns • Identifies prompt instruction weaknesses • Generates structured hypotheses • Samples configuration candidates (Monte Carlo parameter exploration) • Runs Champion vs Challenger comparison • Applies statistical promotion gating
Each iteration must:
• Improve validation score • Reduce variance • Avoid regression • Maintain cost/latency constraints
The agent progresses:
v0 → v1 → v2 → vN
If an iteration fails validation, it is rejected.
Rollback is automatic.
This prevents trial-and-error drift.
⸻
4. Regression Gate & Certification
Before certification:
• Full benchmark suite is re-run • Monte Carlo repeated executions measure stability • Variance must remain below threshold • Validation + hidden test must both improve • No regression allowed
Only then does the agent earn certification.
Improvement must generalise — not just fit the training set.
⸻
Demonstrated Measurable Improvement (Demo)
In our live demo:
v0 Train: 68% Validation: 60% Test: 58%
v1 Train: 75% Validation: 72% Test: 70%
v2 Train: 84% Validation: 82% Test: 81%
Across iterations:
Accuracy increased Variance decreased Latency reduced Cost reduced Generalisation gap narrowed
NOVA proves improvement on unseen data — not just across runs.
⸻
What Inspired Us
Today, AI agents are deployed based on demos — not credentials.
There is no accreditation system.
No structured validation.
No statistical confidence.
We wanted to build a framework where agents can:
• Learn • Be evaluated rigorously • Compete • Improve under constraints • Earn certification
Improvement should not be claimed.
It should be statistically validated.
⸻
How We Built It
• AWS Bedrock for multi-model execution • Strands for multi-agent orchestration • Langfuse for trace logging, evaluation tracking, prompt versioning • ClickHouse Cloud for benchmark analytics and version comparison • PostgreSQL for version registry • Dockerized benchmark runner for deterministic execution
⸻
Challenges We Faced
• Preventing LLM-as-judge hallucination • Designing statistically meaningful scoring • Avoiding benchmark overfitting • Measuring true generalisation • Handling unstable improvements
We addressed this with:
• Deterministic schema validation • Multi-run stability checks (Monte Carlo execution) • Train / validation / hidden test splits • Champion–Challenger gating • Strict regression rejection
⸻
NOVA transforms static agents into statistically validated, self-improving systems.
Improvement is not assumed. It is measured, validated, and certified.
Built With
- amazon-web-services
- bedrock
- clickhouse
- cloud
- docker
- fastapi
- langfuse
- python
- strands
Log in or sign up for Devpost to join the conversation.