CodeTribunal -- The AI Courtroom for Code Quality & Risk
What It Does
CodeTribunal is an AI-powered system that puts freelance code on trial.
Instead of generating a generic code review, it simulates a structured courtroom process:
- Evidence is collected using deterministic AST analysis (GritQL)
- Multiple AI agents investigate the codebase using real tools (file reads, pattern search, call tracing)
- A Prosecutor and Defense argue over the findings
- A Judge delivers a final verdict with a risk score
- An Expert Witness answers follow-up questions with full trial context + live tool access
The result is not just analysis -- it's a decision system:
Is this code safe, professional, and worth paying for?
The Problem
Clients regularly receive code they cannot evaluate:
- It passes linters but contains security vulnerabilities
- It works short-term but is architecturally fragile
- It looks clean but hides technical debt
Traditional tools fail because:
- Linters only catch syntax issues
- Static analyzers lack context
- Code reviews are subjective and inconsistent
There is no system that combines evidence, reasoning, and judgment.
The Solution
CodeTribunal introduces a new paradigm: code review as a legal process.
It transforms raw code into a structured, adversarial evaluation pipeline:
- Forensic Evidence (deterministic) -- AST-based scanning using GritQL with 18 forensic patterns detects real vulnerabilities (secrets, injections, unsafe functions, weak hashing)
- Code Dependency Graph -- Builds call chains to trace real impact (e.g.
eval()reachesAPI routewhich reachesuser input) - AI Investigation (ReACT agents) -- Agents don't just "analyze" -- they read files, trace execution paths, and verify findings with tools
- Courtroom Trial -- Prosecutor builds the case, Defense challenges assumptions, Rebuttal strengthens arguments
- Final Verdict -- GUILTY / MIXED / NOT GUILTY + Risk score (0-100) + Severity-ranked findings
- Professional Report -- Actionable fixes, impact analysis, and estimated effort. Export to Markdown or PDF
- Expert Witness Q&A -- After the verdict, ask follow-up questions. An Expert Witness agent has full trial context plus tools (FileReader, CodeGraphQuery, FindingContext) to look up specific files, trace call chains, and explain findings in detail.
Ask things like:
- "Why was
eval()marked as critical?" - "Show me the call chain for the SQL injection finding"
- "What's the actual risk of the hardcoded password?"
How We Built It
CodeTribunal is a hybrid deterministic + agentic system.
Core Stack
| Component | Technology | Purpose |
|---|---|---|
| LLM | GLM 5.1 via Z.ai (LiteLLM) | Agent reasoning and debate |
| Code Scanning | GritQL | Deterministic AST-level pattern matching |
| Multi-Agent | CrewAI | Agent orchestration, task chaining, context handoffs |
| Function Calling | LiteLLM | Direct ReACT loop with GLM-5 tool calling |
| Code Graph | Python ast + regex |
Dependency graph (Python + JS) |
| UI | Gradio | Streaming chatbot, file upload, export |
| Export | markdown-pdf (PyMuPDF) | PDF report generation |
Key Technical Innovations
1. Real ReACT agents (not fake reasoning)
We bypassed unreliable agent tool routing by implementing:
- Direct function calling via LiteLLM with GLM-5
- Iterative Reason, Act, Observe loops
- Agents that actually execute tools on code
2. Deterministic + AI hybrid model
- GritQL provides ground-truth evidence
- Agents provide interpretation, context, and argumentation
- This avoids hallucinations while preserving reasoning power
3. Code graph + call-chain tracing
We built a lightweight dependency graph to:
- Track function calls and imports
- Trace vulnerability impact across the system
- Answer questions like: "This hardcoded secret is reachable from a public API endpoint"
4. Stateful pipeline engine
- Execution persisted to JSON
- Supports resume after failure
- Handles rate limits with exponential backoff
5. Structured multi-agent debate
Using CrewAI context chaining:
- Prosecutor, Defense, Rebuttal
- Each agent builds on previous outputs
- This creates adversarial reasoning, not one-sided analysis
Challenges We Ran Into
1. Tool reliability in agents
CrewAI's default tool routing was inconsistent with GLM-5. Tools would be ignored or called with wrong arguments.
Solution: Built a custom ReACT loop using LiteLLM's direct function calling, bypassing CrewAI's unreliable tool routing entirely.
2. Balancing determinism vs. reasoning
Pure LLM analysis produces hallucinations. Pure static analysis lacks context.
Solution: Hybrid pipeline -- deterministic GritQL evidence grounds the agents, while agents provide interpretation, argument, and debate.
3. Scaling to large codebases
Feeding entire repos into an LLM is not feasible.
Solution: Agents retrieve context on demand via tools. No full codebase dumping. They read specific files, search patterns, and trace calls.
4. API rate limits
Multi-agent systems with 8 agents making tool calls quickly hit Z.ai rate limits.
Solution: Exponential backoff (4s to 64s, max 5 retries) on both CrewAI kickoff and LiteLLM completion calls. Pipeline state persisted so runs survive interruptions.
Accomplishments We're Proud Of
- Built a fully working multi-agent system with real tool execution (not simulated)
- Combined AST-level static analysis with agent reasoning in a hybrid pipeline
- Created a new UX paradigm: courtroom simulation for code review
- Implemented end-to-end pipeline from raw code upload to structured verdict + PDF report
- Achieved resilient execution with retry logic + pipeline persistence + resume support
- 8 specialized agents, each with defined roles, tools, and domain expertise
- Interactive Expert Witness Q&A -- ask follow-up questions after the verdict with an agent that retrieves specific evidence on demand
What We Learned
- Agent systems only work when tools are reliable -- we had to build our own ReACT loop
- Deterministic signals are critical to ground LLM reasoning and prevent hallucinations
- Adversarial setups (prosecutor vs. defense debate) produce higher-quality outputs than single-agent analysis
- Context retrieval (tools) beats large context windows (dumping everything into the prompt)
What's Next
- Support for more languages (Java, Go, Rust)
- GitHub integration (analyze PRs automatically)
- Team dashboards for agencies and clients
- Fine-tuned risk scoring based on real-world incidents
- SaaS platform for freelancer-client trust
Impact
CodeTribunal reduces one of the biggest trust gaps in tech: clients don't know if code is good.
This system enables:
- Better hiring decisions
- Safer software delivery
- Accountability in freelance work
Links
- Live Demo: https://huggingface.co/spaces/amine-yagoub/CodeTribunal
- Source Code: https://github.com/amineyagoub/CodeTribunal
- Built with GLM 5.1 for the Build with GLM 5.1 Hackathon
Built With
- crewai
- glm-5.1
- gradio
- gritql
- python
Log in or sign up for Devpost to join the conversation.