VibeCheck

Inspiration

Developer tools often assume the AI knows what it's doing, but what if the human doesn't? We built VibeCheck to ensure that AI-assisted code changes come with genuine understanding and not just autocomplete.

What it does

VibeCheck is a knowledge gate and QA loop scaffold for Claude Code that:

Intercepts code mutations before they're applied
Evaluates complexity against the user's demonstrated competence
Generates targeted questions using LLMs that probe understanding of why the change works, not just what it does
Adapts difficulty across 3 scaffolding levels (conceptual → guided → hinting)
Tracks competence over time in a persistent YAML model

Small, safe changes pass automatically. Complex changes (concurrency, async patterns, multiprocessing) trigger an interactive QA session that either validates understanding or applies a competence penalty.

How we built it

Built with Python-first architecture, LangChain for structured outputs, and OpenRouter for model access.

Challenges we ran into

LLM consistency: Structured outputs from LLMs required careful prompt engineering and output parsing
Question scaffolding: Balancing difficulty across 3 attempt levels without making questions trivial
Evaluation strictness: Finding the right criteria for different question types (true/false vs. plain english vs. faded examples)
Test isolation: Mocking the OpenRouter client while maintaining realistic test coverage

Accomplishments that we're proud of

Spec-first development: Started with finalized_MVP_spec.md and built the entire system from it
Clean architecture: Separate concerns between gate logic, Q&A orchestration, and persistence
Interactive demo: test_script.py simulates realistic VibeCheck runs with complex concurrency patterns
Comprehensive testing: a multitude of tests covering gate, normalization, aggregation, and QA loop

What we learned

LLM integration patterns: Structured outputs via LangChain's with_structured_output() are powerful but require defensive type handling
Competence tracking: Simple score adjustments (delta-based) with evidence logging creates an auditable learning record
Python packaging: uv as a unified tool for dependency management, testing, and linting

What's next for VibeCheck

Vector-based concept hierarchy for nested competence tracking (e.g., 'python.async' under 'python.concurrency')
Batch mode for validating multiple related changes at once
Analytics dashboard showing competence trends over time