CodeTribunal -- The AI Courtroom for Code Quality & Risk

What It Does

CodeTribunal is an AI-powered system that puts freelance code on trial.

Instead of generating a generic code review, it simulates a structured courtroom process:

Evidence is collected using deterministic AST analysis (GritQL)
Multiple AI agents investigate the codebase using real tools (file reads, pattern search, call tracing)
A Prosecutor and Defense argue over the findings
A Judge delivers a final verdict with a risk score
An Expert Witness answers follow-up questions with full trial context + live tool access

The result is not just analysis -- it's a decision system:

Is this code safe, professional, and worth paying for?

The Problem

Clients regularly receive code they cannot evaluate:

It passes linters but contains security vulnerabilities
It works short-term but is architecturally fragile
It looks clean but hides technical debt

Traditional tools fail because:

Linters only catch syntax issues
Static analyzers lack context
Code reviews are subjective and inconsistent

There is no system that combines evidence, reasoning, and judgment.

The Solution

CodeTribunal introduces a new paradigm: code review as a legal process.

It transforms raw code into a structured, adversarial evaluation pipeline:

Forensic Evidence (deterministic) -- AST-based scanning using GritQL with 18 forensic patterns detects real vulnerabilities (secrets, injections, unsafe functions, weak hashing)
Code Dependency Graph -- Builds call chains to trace real impact (e.g. eval() reaches API route which reaches user input)
AI Investigation (ReACT agents) -- Agents don't just "analyze" -- they read files, trace execution paths, and verify findings with tools
Courtroom Trial -- Prosecutor builds the case, Defense challenges assumptions, Rebuttal strengthens arguments
Final Verdict -- GUILTY / MIXED / NOT GUILTY + Risk score (0-100) + Severity-ranked findings
Professional Report -- Actionable fixes, impact analysis, and estimated effort. Export to Markdown or PDF
Expert Witness Q&A -- After the verdict, ask follow-up questions. An Expert Witness agent has full trial context plus tools (FileReader, CodeGraphQuery, FindingContext) to look up specific files, trace call chains, and explain findings in detail.

Ask things like:

"Why was eval() marked as critical?"
"Show me the call chain for the SQL injection finding"
"What's the actual risk of the hardcoded password?"

How We Built It

CodeTribunal is a hybrid deterministic + agentic system.

Core Stack

Component	Technology	Purpose
LLM	GLM 5.1 via Z.ai (LiteLLM)	Agent reasoning and debate
Code Scanning	GritQL	Deterministic AST-level pattern matching
Multi-Agent	CrewAI	Agent orchestration, task chaining, context handoffs
Function Calling	LiteLLM	Direct ReACT loop with GLM-5 tool calling
Code Graph	Python `ast` + regex	Dependency graph (Python + JS)
UI	Gradio	Streaming chatbot, file upload, export
Export	markdown-pdf (PyMuPDF)	PDF report generation

Key Technical Innovations

1. Real ReACT agents (not fake reasoning)

We bypassed unreliable agent tool routing by implementing:

Direct function calling via LiteLLM with GLM-5
Iterative Reason, Act, Observe loops
Agents that actually execute tools on code

2. Deterministic + AI hybrid model

GritQL provides ground-truth evidence
Agents provide interpretation, context, and argumentation
This avoids hallucinations while preserving reasoning power

3. Code graph + call-chain tracing

We built a lightweight dependency graph to:

Track function calls and imports
Trace vulnerability impact across the system
Answer questions like: "This hardcoded secret is reachable from a public API endpoint"

4. Stateful pipeline engine

Execution persisted to JSON
Supports resume after failure
Handles rate limits with exponential backoff

5. Structured multi-agent debate

Using CrewAI context chaining:

Prosecutor, Defense, Rebuttal
Each agent builds on previous outputs
This creates adversarial reasoning, not one-sided analysis

Challenges We Ran Into

1. Tool reliability in agents

CrewAI's default tool routing was inconsistent with GLM-5. Tools would be ignored or called with wrong arguments.

Solution: Built a custom ReACT loop using LiteLLM's direct function calling, bypassing CrewAI's unreliable tool routing entirely.

2. Balancing determinism vs. reasoning

Pure LLM analysis produces hallucinations. Pure static analysis lacks context.

Solution: Hybrid pipeline -- deterministic GritQL evidence grounds the agents, while agents provide interpretation, argument, and debate.

3. Scaling to large codebases

Feeding entire repos into an LLM is not feasible.

Solution: Agents retrieve context on demand via tools. No full codebase dumping. They read specific files, search patterns, and trace calls.

4. API rate limits

Multi-agent systems with 8 agents making tool calls quickly hit Z.ai rate limits.

Solution: Exponential backoff (4s to 64s, max 5 retries) on both CrewAI kickoff and LiteLLM completion calls. Pipeline state persisted so runs survive interruptions.

Accomplishments We're Proud Of

Built a fully working multi-agent system with real tool execution (not simulated)
Combined AST-level static analysis with agent reasoning in a hybrid pipeline
Created a new UX paradigm: courtroom simulation for code review
Implemented end-to-end pipeline from raw code upload to structured verdict + PDF report
Achieved resilient execution with retry logic + pipeline persistence + resume support
8 specialized agents, each with defined roles, tools, and domain expertise
Interactive Expert Witness Q&A -- ask follow-up questions after the verdict with an agent that retrieves specific evidence on demand

What We Learned

Agent systems only work when tools are reliable -- we had to build our own ReACT loop
Deterministic signals are critical to ground LLM reasoning and prevent hallucinations
Adversarial setups (prosecutor vs. defense debate) produce higher-quality outputs than single-agent analysis
Context retrieval (tools) beats large context windows (dumping everything into the prompt)

What's Next

Support for more languages (Java, Go, Rust)
GitHub integration (analyze PRs automatically)
Team dashboards for agencies and clients
Fine-tuned risk scoring based on real-world incidents
SaaS platform for freelancer-client trust

Impact

CodeTribunal reduces one of the biggest trust gaps in tech: clients don't know if code is good.

This system enables:

Better hiring decisions
Safer software delivery
Accountability in freelance work

Built With

crewai
glm-5.1
gradio
gritql
python

Updates

Amine Yagoub started this project — Apr 04, 2026 09:08 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.