CodeTribunal -- The AI Courtroom for Code Quality & Risk

What It Does

CodeTribunal is an AI-powered system that puts freelance code on trial.

Instead of generating a generic code review, it simulates a structured courtroom process:

  1. Evidence is collected using deterministic AST analysis (GritQL)
  2. Multiple AI agents investigate the codebase using real tools (file reads, pattern search, call tracing)
  3. A Prosecutor and Defense argue over the findings
  4. A Judge delivers a final verdict with a risk score
  5. An Expert Witness answers follow-up questions with full trial context + live tool access

The result is not just analysis -- it's a decision system:

Is this code safe, professional, and worth paying for?


The Problem

Clients regularly receive code they cannot evaluate:

  • It passes linters but contains security vulnerabilities
  • It works short-term but is architecturally fragile
  • It looks clean but hides technical debt

Traditional tools fail because:

  • Linters only catch syntax issues
  • Static analyzers lack context
  • Code reviews are subjective and inconsistent

There is no system that combines evidence, reasoning, and judgment.


The Solution

CodeTribunal introduces a new paradigm: code review as a legal process.

It transforms raw code into a structured, adversarial evaluation pipeline:

  1. Forensic Evidence (deterministic) -- AST-based scanning using GritQL with 18 forensic patterns detects real vulnerabilities (secrets, injections, unsafe functions, weak hashing)
  2. Code Dependency Graph -- Builds call chains to trace real impact (e.g. eval() reaches API route which reaches user input)
  3. AI Investigation (ReACT agents) -- Agents don't just "analyze" -- they read files, trace execution paths, and verify findings with tools
  4. Courtroom Trial -- Prosecutor builds the case, Defense challenges assumptions, Rebuttal strengthens arguments
  5. Final Verdict -- GUILTY / MIXED / NOT GUILTY + Risk score (0-100) + Severity-ranked findings
  6. Professional Report -- Actionable fixes, impact analysis, and estimated effort. Export to Markdown or PDF
  7. Expert Witness Q&A -- After the verdict, ask follow-up questions. An Expert Witness agent has full trial context plus tools (FileReader, CodeGraphQuery, FindingContext) to look up specific files, trace call chains, and explain findings in detail.

Ask things like:

  • "Why was eval() marked as critical?"
  • "Show me the call chain for the SQL injection finding"
  • "What's the actual risk of the hardcoded password?"

How We Built It

CodeTribunal is a hybrid deterministic + agentic system.

Core Stack

Component Technology Purpose
LLM GLM 5.1 via Z.ai (LiteLLM) Agent reasoning and debate
Code Scanning GritQL Deterministic AST-level pattern matching
Multi-Agent CrewAI Agent orchestration, task chaining, context handoffs
Function Calling LiteLLM Direct ReACT loop with GLM-5 tool calling
Code Graph Python ast + regex Dependency graph (Python + JS)
UI Gradio Streaming chatbot, file upload, export
Export markdown-pdf (PyMuPDF) PDF report generation

Key Technical Innovations

1. Real ReACT agents (not fake reasoning)

We bypassed unreliable agent tool routing by implementing:

  • Direct function calling via LiteLLM with GLM-5
  • Iterative Reason, Act, Observe loops
  • Agents that actually execute tools on code

2. Deterministic + AI hybrid model

  • GritQL provides ground-truth evidence
  • Agents provide interpretation, context, and argumentation
  • This avoids hallucinations while preserving reasoning power

3. Code graph + call-chain tracing

We built a lightweight dependency graph to:

  • Track function calls and imports
  • Trace vulnerability impact across the system
  • Answer questions like: "This hardcoded secret is reachable from a public API endpoint"

4. Stateful pipeline engine

  • Execution persisted to JSON
  • Supports resume after failure
  • Handles rate limits with exponential backoff

5. Structured multi-agent debate

Using CrewAI context chaining:

  • Prosecutor, Defense, Rebuttal
  • Each agent builds on previous outputs
  • This creates adversarial reasoning, not one-sided analysis

Challenges We Ran Into

1. Tool reliability in agents

CrewAI's default tool routing was inconsistent with GLM-5. Tools would be ignored or called with wrong arguments.

Solution: Built a custom ReACT loop using LiteLLM's direct function calling, bypassing CrewAI's unreliable tool routing entirely.

2. Balancing determinism vs. reasoning

Pure LLM analysis produces hallucinations. Pure static analysis lacks context.

Solution: Hybrid pipeline -- deterministic GritQL evidence grounds the agents, while agents provide interpretation, argument, and debate.

3. Scaling to large codebases

Feeding entire repos into an LLM is not feasible.

Solution: Agents retrieve context on demand via tools. No full codebase dumping. They read specific files, search patterns, and trace calls.

4. API rate limits

Multi-agent systems with 8 agents making tool calls quickly hit Z.ai rate limits.

Solution: Exponential backoff (4s to 64s, max 5 retries) on both CrewAI kickoff and LiteLLM completion calls. Pipeline state persisted so runs survive interruptions.


Accomplishments We're Proud Of

  • Built a fully working multi-agent system with real tool execution (not simulated)
  • Combined AST-level static analysis with agent reasoning in a hybrid pipeline
  • Created a new UX paradigm: courtroom simulation for code review
  • Implemented end-to-end pipeline from raw code upload to structured verdict + PDF report
  • Achieved resilient execution with retry logic + pipeline persistence + resume support
  • 8 specialized agents, each with defined roles, tools, and domain expertise
  • Interactive Expert Witness Q&A -- ask follow-up questions after the verdict with an agent that retrieves specific evidence on demand

What We Learned

  • Agent systems only work when tools are reliable -- we had to build our own ReACT loop
  • Deterministic signals are critical to ground LLM reasoning and prevent hallucinations
  • Adversarial setups (prosecutor vs. defense debate) produce higher-quality outputs than single-agent analysis
  • Context retrieval (tools) beats large context windows (dumping everything into the prompt)

What's Next

  • Support for more languages (Java, Go, Rust)
  • GitHub integration (analyze PRs automatically)
  • Team dashboards for agencies and clients
  • Fine-tuned risk scoring based on real-world incidents
  • SaaS platform for freelancer-client trust

Impact

CodeTribunal reduces one of the biggest trust gaps in tech: clients don't know if code is good.

This system enables:

  • Better hiring decisions
  • Safer software delivery
  • Accountability in freelance work

Links

Built With

  • crewai
  • glm-5.1
  • gradio
  • gritql
  • python
Share this project:

Updates