AutoEval Agent: A Self-Improving Research System

Inspiration

Large language models are powerful, but most AI agents today generate answers once and stop. They do not critically evaluate their own outputs or improve their strategy over time.

We were inspired by the idea that truly autonomous systems should not only generate responses — they should reflect, evaluate, and optimize.

The challenge of building a self-improving agent within a single-day hackathon motivated us to design a lightweight but meaningful feedback loop that demonstrates measurable improvement.

What It Does

AutoEval Agent is a self-improving research system that:

Retrieves relevant documents from a knowledge base

Generates a structured research answer

Evaluates its own response using an LLM-based scoring system

Adjusts its retrieval strategy if the score is below a threshold

Stores high-performing strategies for future use

The system iteratively improves answer quality using an evaluation loop.

How We Built It

The system is structured into four core components:

Retriever Module – Fetches relevant documents from a vector store

Generator Module – Produces structured research responses

Evaluator Module – Scores outputs on relevance, factual grounding, and coverage

Memory Layer – Stores successful retrieval patterns and prompt structures

The self-improvement loop works as follows:

Initial query → Retrieve → Generate answer

Evaluate answer → Assign score 𝑆 ∈ [ 0 , 10 ] S∈[0,10]

If 𝑆 < 𝑇 S<T (threshold), modify retrieval strategy and retry

Store best-performing configuration

Displayed mathematically:

Improvement Loop: Improvement Loop: Query → Generate → Evaluate → Adjust Strategy Query→Generate→Evaluate→Adjust Strategy

Over iterations, the system converges toward higher-quality outputs.

What We Learned

Designing evaluation metrics is as important as generation

Feedback loops significantly improve output reliability

Memory-based optimization enables lightweight adaptation without retraining

Clear architecture matters more than feature count in constrained hackathon settings

Challenges We Faced

Designing a scoring system that reflects real answer quality

Preventing infinite retry loops

Balancing improvement depth with 6-hour build constraints Keeping architecture clean and demo-ready. We focused on clarity, measurable improvement, and architectural simplicity.

Built With

agent
ai
amazon
api
claude
cloud
databases
evaluation
faiss
fastapi
frameworks:
infrastructure:
json-based
llm-as-judge
local
memory
memory:
method
openai
python
services
store
tools:
web

Updates

Ahmad Ali started this project — Feb 14, 2026 06:30 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.