Inspiration

Developer tools often assume the AI knows what it's doing, but what if the human doesn't? We built VibeCheck to ensure that AI-assisted code changes come with genuine understanding and not just autocomplete.

What it does

VibeCheck is a knowledge gate and QA loop scaffold for Claude Code that:

  1. Intercepts code mutations before they're applied
  2. Evaluates complexity against the user's demonstrated competence
  3. Generates targeted questions using LLMs that probe understanding of why the change works, not just what it does
  4. Adapts difficulty across 3 scaffolding levels (conceptual → guided → hinting)
  5. Tracks competence over time in a persistent YAML model

Small, safe changes pass automatically. Complex changes (concurrency, async patterns, multiprocessing) trigger an interactive QA session that either validates understanding or applies a competence penalty.

How we built it

Built with Python-first architecture, LangChain for structured outputs, and OpenRouter for model access.

Challenges we ran into

  • LLM consistency: Structured outputs from LLMs required careful prompt engineering and output parsing
  • Question scaffolding: Balancing difficulty across 3 attempt levels without making questions trivial
  • Evaluation strictness: Finding the right criteria for different question types (true/false vs. plain english vs. faded examples)
  • Test isolation: Mocking the OpenRouter client while maintaining realistic test coverage

Accomplishments that we're proud of

  • Spec-first development: Started with finalized_MVP_spec.md and built the entire system from it
  • Clean architecture: Separate concerns between gate logic, Q&A orchestration, and persistence
  • Interactive demo: test_script.py simulates realistic VibeCheck runs with complex concurrency patterns
  • Comprehensive testing: a multitude of tests covering gate, normalization, aggregation, and QA loop

What we learned

  • LLM integration patterns: Structured outputs via LangChain's with_structured_output() are powerful but require defensive type handling
  • Competence tracking: Simple score adjustments (delta-based) with evidence logging creates an auditable learning record
  • Python packaging: uv as a unified tool for dependency management, testing, and linting

What's next for VibeCheck

  • Vector-based concept hierarchy for nested competence tracking (e.g., 'python.async' under 'python.concurrency')
  • Batch mode for validating multiple related changes at once
  • Analytics dashboard showing competence trends over time

Built With

  • gpt-4o
  • langchain
  • openrouter
  • pyright
  • pytest
  • python
  • pyyaml
  • ruff
  • uv
Share this project:

Updates