Inspiration
Large language models are powerful, but most AI agents today generate answers once and stop. They do not critically evaluate their own outputs or improve their strategy over time.
We were inspired by the idea that truly autonomous systems should not only generate responses — they should reflect, evaluate, and optimize.
The challenge of building a self-improving agent within a single-day hackathon motivated us to design a lightweight but meaningful feedback loop that demonstrates measurable improvement.
What It Does
AutoEval Agent is a self-improving research system that:
Retrieves relevant documents from a knowledge base
Generates a structured research answer
Evaluates its own response using an LLM-based scoring system
Adjusts its retrieval strategy if the score is below a threshold
Stores high-performing strategies for future use
The system iteratively improves answer quality using an evaluation loop.
How We Built It
The system is structured into four core components:
Retriever Module – Fetches relevant documents from a vector store
Generator Module – Produces structured research responses
Evaluator Module – Scores outputs on relevance, factual grounding, and coverage
Memory Layer – Stores successful retrieval patterns and prompt structures
The self-improvement loop works as follows:
Initial query → Retrieve → Generate answer
Evaluate answer → Assign score 𝑆 ∈ [ 0 , 10 ] S∈[0,10]
If 𝑆 < 𝑇 S<T (threshold), modify retrieval strategy and retry
Store best-performing configuration
Displayed mathematically:
Improvement Loop: Improvement Loop: Query → Generate → Evaluate → Adjust Strategy Query→Generate→Evaluate→Adjust Strategy
Over iterations, the system converges toward higher-quality outputs.
What We Learned
Designing evaluation metrics is as important as generation
Feedback loops significantly improve output reliability
Memory-based optimization enables lightweight adaptation without retraining
Clear architecture matters more than feature count in constrained hackathon settings
Challenges We Faced
Designing a scoring system that reflects real answer quality
Preventing infinite retry loops
Balancing improvement depth with 6-hour build constraints Keeping architecture clean and demo-ready. We focused on clarity, measurable improvement, and architectural simplicity.
Log in or sign up for Devpost to join the conversation.