VibeCheckAI - Runtime Validation for AI-Generated Code

Inspiration

We've all pushed code that crashes in production, leaks API keys, or fails silently. Traditional static analysis misses runtime issues, and manual reviews are slow.

The problem: We trust AI-generated code based on intent, not behavior.

Our solution: Execute code in isolation, observe real behavior, and issue deployment certificates. We replace trust in AI intent with trust in observed behavior.

What it does

VibeCheckAI is a local-first security agent that validates code through runtime behavioral analysis:

  • 🛡️ Sandboxed Execution: Runs code in isolated Daytona workspaces (zero risk to your system)
  • 🔬 3-Sensor Safety Model:
    • Crash Sensor: Detects runtime errors, division by zero, exceptions
    • Pulse Sensor: Validates HTTP responsiveness and health checks
    • Leak Sensor: Scans for hardcoded secrets (API keys, tokens, passwords)
  • ⚖️ Intelligent Verdict Engine: Weighted risk scoring (0-100) with context-aware recommendations
  • 🏆 Execution Certificates: Cryptographically-signed certificates for validated code
  • 🤖 AI Integration: CodeRabbit analyzes incidents and provides fix suggestions via GitHub PRs
  • 📡 Real-time Monitoring: Sentry alerts for security incidents

Output: Clear verdicts (SAFE ✅ / CAUTION ⚠️ / BLOCKED 🚫) with actionable recommendations.

How we built it

Architecture (3-person team, 4-hour build):

Streamlit UI → Orchestrator → [Daytona Runner + Sensor Suite] → Verdict Engine → Certificate

Tech Stack:

  • Python 3.8+ with subprocess orchestration
  • Daytona for isolated workspace execution
  • Streamlit with custom mission-control theme
  • Sentry SDK for real-time incident monitoring
  • PyGithub for CodeRabbit PR automation
  • Regex-based secret scanning (20+ patterns)

Key Components:

  1. Orchestrator (orchestrator.py): Coordinates workspace creation, execution, and sensor collection
  2. Runner (internal_runner/runner.py): Executes code, tests routes, captures stdout/stderr
  3. Sensors (signals/):
    • scan_secrets.py: Pattern matching for API keys, tokens
    • sentry_reporter.py: Real-time alerting
    • coderabbit_trigger.py: Automated PR creation with fix suggestions
  4. Verdict Engine (verdict_engine.py): Weighted scoring algorithm (Leak: -100, Crash: -40/-60, Pulse: -20)
  5. UI (app.py): Mission-control themed dashboard with real-time progress

Development Workflow:

  • Defined execution_report.json schema as team contract
  • Parallel development with mock data
  • Hour 3 integration checkpoint
  • Continuous testing throughout

Challenges we ran into

  1. Sandbox Security vs. Functionality

    • Challenge: Running untrusted code safely while detecting real issues
    • Solution: Daytona isolation + intelligent process monitoring
    • Learning: Security isolation doesn't mean blind execution
  2. JavaScript Division by Zero Detection

    • Challenge: 1/0 returns Infinity (doesn't crash), but it's still a logic error
    • Solution: Route testing with response content analysis, detect Infinity/NaN values
    • Result: Catches critical logic errors that static analysis misses
  3. CodeRabbit Path Filters

    • Challenge: CodeRabbit skips .log files by default
    • Solution: Generate structured .md reports with code snippets and fix suggestions
    • Result: CodeRabbit now analyzes incidents and provides actionable recommendations
  4. Real-time UI Performance

    • Challenge: Streamlit can lag with complex layouts
    • Solution: Efficient state management, cached components, 4-tab architecture
    • Trade-off: Clarity over complexity

Accomplishments that we're proud of

End-to-end MVP in 4 hours: Complete workflow from repo clone to certificate generation

Multi-modal detection: Catches security leaks, runtime crashes, AND logic errors (division by zero)

Intelligent risk scoring: Research-based weighting prioritizes security (leak = instant fail) while providing nuanced assessment

Production-ready integrations:

  • Daytona workspace automation
  • Sentry real-time alerting
  • CodeRabbit AI analysis with fix suggestions

Mission-control UI: Professional dashboard with real-time execution logs, sensor status, and certificate generation

Comprehensive testing: 4 test scenarios covering all edge cases, automated test suite

What we learned

Technical:

  • Runtime validation > static analysis: Executing code reveals issues that pattern matching misses (e.g., division by zero returning Infinity)
  • Weighted scoring needs domain knowledge: Generic algorithms miss nuance; security leaks must be instant-fail
  • JSON schemas enable parallel development: Clear contracts = independent work streams
  • Mock data accelerates development: 70% of UI work completed before integration

Process:

  • Define interfaces first: execution_report.json schema was our team contract
  • Hour-based milestones: Clear checkpoints kept 3-person team aligned
  • Independent testing: Each component had its own test suite before integration

Design:

  • Binary verdicts at some level: Users want "safe to deploy" or "not safe," not probabilities
  • Actionable recommendations: "Remove line 42 from config.js" > "Fix the security leak"
  • Visual hierarchy matters: Mission-control theme with color-coded status badges

What's next for VibeCheckAI

Immediate (Next Sprint):

  • Enhanced detection: SQL injection, XSS scanning, code coverage analysis
  • CI/CD integration: GitHub Actions, pre-commit hooks, PR status checks
  • Performance profiling: Memory leaks, CPU usage, response time analysis

Short-term (Q1 2026):

  • AI-powered fix suggestions: Auto-generate patches for detected issues
  • Team dashboards: Multi-repo monitoring, trend analysis
  • Language expansion: Python, Java, Go support beyond Node.js

Long-term Vision:

  • Open-source sensor marketplace: Community-contributed detection patterns
  • Enterprise features: Self-hosted deployment, SSO, compliance reports
  • Monetization: Free tier (10 scans/month), Pro ($29/mo), Team ($99/mo), Enterprise (custom)

Impact Goal: Make runtime validation as standard as linting - every repo gets a "vibe check" before deployment.


Built with: Daytona, Sentry, CodeRabbit Repository: https://github.com/harshapps/VibeCheckAI Demo: Run streamlit run app.py and validate any repository

Built With

  • coderabbit
  • css
  • daytona
  • elevenlabs
  • git/github
  • hashlib
  • json
  • python
  • streamlit
Share this project:

Updates