Pro-Bias: Self-Improving Human-Aligned Subjective LLM Evals
Inspiration
Inconsistent and misaligned subjective evaluations plague LLM development. We set out to create a framework that brings human-like judgment to AI evaluations.
What it does
- ComparisonGEval: Enhanced framework for consistent, human-aligned subjective LLM evaluations
- Synthetic datasets: Demonstrate ComparisonGEval's effectiveness across diverse tasks
- Example Evals: Showcase versatility in text casualization, conversation naming, and more
- Automated Essay Grading Agent: Iteratively improves rubrics to align with human raters
How we built it
- Enhanced GEval with structured prompts and choice-based scoring
- Generated synthetic datasets using Claude 3.5 Sonnet
- Developed example evals for various subjective tasks
- Created an agent that iteratively refines grading rubrics
Challenges we ran into
- Ensuring consistency in subjective evaluations
- Generating diverse, representative synthetic data
- Aligning AI judgments with human raters
Accomplishments that we're proud of
- Achieved substantial agreement with human raters on representative samples
- Developed a versatile framework applicable to various subjective tasks
- Created an iterative system that improves itself to match human judgment (!! HOLY GRAIL ALERT !!)
What we learned
- The importance of structured prompts in subjective evaluations
- Techniques for generating effective synthetic datasets
- Strategies for aligning AI systems with human judgment
What's next
- Expand to more complex subjective tasks
- Integrate with popular LLM development workflows
- Explore applications in educational technology and content moderation
Try it out
- Clone the repo
- Set up environment: Python 3.11, virtualenv, requirements
- Configure API keys: OpenAI/Sambanova, Anthropic
- Run evals:
NUM_EXAMPLES=10 ./run_python.sh python evals/eval_make_text_more_casual.py - Optimize essay rubrics with AI agent:
./run_python.sh python src/agents/essay_rubric_optimizer.py
Built With
- anthropic
- cursor
- deepeval
- intellij-idea
- markdown
- openai
- python
- sambanova
- weightsandbiases

Log in or sign up for Devpost to join the conversation.