All Rise — About the Project
Inspiration
Most AI demos are cooperative. You ask, it helps. The interaction is flat — the AI defers to you, agrees with you, and exists to make your life easier.
We wanted to build something different: an AI that argues back.
The idea came from a simple frustration. We've all been in arguments we lost not because we were wrong, but because we weren't prepared — we didn't see the attack coming, we repeated ourselves, we used weak evidence. What if you could practice adversarial thinking against an opponent that never gets tired, never lets a logical gap slide, and remembers every mistake you've made?
The courtroom format was an obvious fit. It has structure (phases, turns, a verdict), clear stakes (you can lose), and enough absurdity in the premise to make it genuinely fun. Nobody wants to practice arguing under pressure. But everyone will defend themselves against the charge of "Unlicensed Philosophy — deploying 'but what even is reality?' at a neighborhood barbecue without a PhD."
What We Built
All Rise is a fully realized AI courtroom simulation. You are the defendant. The charge is absurd. You must defend yourself across five structured phases while:
- Reginald P. Harrington III (the Prosecutor) invents evidence, exploits your logical gaps, and escalates pressure each round — adapting his strategy based on everything you've said
- Judge Constance Virtue watches in silence, accumulates scores round by round, then delivers a structured verdict at the end
- The Strategist offers tactical hints on demand — without playing the trial for you
The trial takes ~4 minutes. There is no guaranteed outcome. You can lose.
How We Built It
Three Real Agents, Not Three Named API Calls
The first architectural decision — and the one everything else depends on — was making the agents genuinely agentic rather than just prompts with personality names.
A naive implementation sends the same system prompt every round. The prosecutor has no memory of what it said in round 1 when it's in round 3. It repeats evidence. It re-opens closed arguments. The simulation falls apart.
The solution is per-trial agent memory stored in the LiveKit agent process:
$$\text{TrialMemory} = {\text{prosecutorMemory}, \text{judgeMemory}, \text{defenseMemory}, \text{fullTranscript}}$$
Each trialId (a UUID generated at startTrial()) maps to its own memory object in a Map<trialId, TrialMemory>. Memory is written after every round and read before every prompt is assembled.
The Prosecutor — ReAct Loop
Every round, before generating a single word, the Prosecutor runs through four deterministic tool calls:
recallWeaknesses(defenseText) → Groq: "what logical gaps does this expose?"
detectFallacy(defenseText) → Groq: "did they commit a named logical fallacy?"
recallAttackStrategy(memory) → pure read: current strategy string
getUnusedEvidence(memory) → pure read: evidence types not yet cited
The results are injected into the prompt as structured context. The Prosecutor then generates its cross-examination using that analysis — not a blank slate. After responding, it writes back: what evidence it used, what weakness it exploited, a one-sentence round summary for the Judge's memory, and an updated attack strategy.
This means by round 3, the Prosecutor knows:
- What you argued in rounds 1 and 2
- Which of its attacks landed and which you deflected
- What evidence it has already deployed (and therefore cannot repeat)
- What your rhetorical tendencies are
The Judge — Chain of Thought with Structured Output
The Judge runs exactly once — after your closing argument — with access to the full trial memory and transcript. We used GPT-4o with strict JSON schema (structured outputs mode) rather than Groq here because the verdict is the one moment where JSON correctness matters more than speed.
Before the LLM call, three tools run:
tallyFallacies(memory) → all logged fallacies from every round
computeScores(memory) → weighted average of per-round scores
checkVerdictConsistency(scores) → pre-check: what do the numbers imply?
Score weighting across rounds:
$$\text{finalScore}d = 0.15 \cdot r_1 + 0.20 \cdot r_2 + 0.25 \cdot r_3 + 0.40 \cdot r{\text{closing}}$$
where $d \in {\text{strength, evidence, logic, persuasion}}$ and $r_i$ is the round $i$ score for that dimension.
The closing argument counts for 40% of the final score. First impressions matter less than last words.
Verdict threshold:
$$\text{verdict} = \begin{cases} \text{Not Guilty} & \text{if } \sum_d \text{finalScore}_d \geq 24 \ \text{Guilty} & \text{otherwise} \end{cases}$$
The Judge can override this threshold if the argument was genuinely exceptional in either direction.
Voice — Three Modes
We built three interaction modes so the app works for any environment:
Text mode — type your defense, with an optional mic button that fills the textarea using Web Speech API STT as you speak (non-auto-submit, so you can edit before sending).
Hybrid auto-voice — a while loop that runs until verdict: speak the prosecution's message via OpenAI TTS, auto-activate mic, capture speech with live interim transcript display, submit on silence. Everything runs in the browser — no WebRTC, no server audio.
Live voice — full WebRTC via LiveKit. Your mic streams to the server-side agent. STT, LLM, and TTS all run server-side. The browser is a thin client showing a transcript and the session state.
Model Selection
| Agent | Model | Reason |
|---|---|---|
| Prosecutor | Groq llama-3.3-70b-versatile |
~200ms response time — cross-examination needs to feel immediate |
| Judge | OpenAI gpt-4o |
Structured JSON output reliability; runs once so latency doesn't matter |
| Defense Assistant | Groq llama-3.3-70b-versatile |
Same speed requirement as Prosecutor |
| TTS | OpenAI tts-1-hd |
Three distinct voices: onyx (Prosecutor), shimmer (Judge), alloy (Defense) |
Infrastructure
Browser (React 18 + Vite)
│
│ HTTP /api/* LiveKit data channel
▼
API Layer (Vercel serverless)
│
│ LiveKit room
▼
Agent Process (@livekit/agents)
├── prosecutorAgent (ReAct, Groq)
├── judgeAgent (chain-of-thought, GPT-4o)
└── defenseAssistant (adaptive hints, Groq)
The API layer serves as a graceful fallback — if the LiveKit agent isn't running, every call routes directly to the serverless functions. You lose per-trial memory and the ReAct loop, but the trial still works.
Challenges
Making the Prosecutor Feel Like It's Actually Tracking You
The hardest part wasn't the AI — it was making the AI feel present. An LLM with no memory produces a prosecutor that could be playing a completely different trial. It might cite the same piece of evidence twice. It might attack an argument the user never made.
The memory system fixed the first two problems. The third — the prosecutor attacking a straw man — required the recallWeaknesses tool to explicitly ground the attack in the actual defense text from that round, not a hallucinated version of it. Injecting the tool result directly into the prompt ("these specific weaknesses were found in what they just said") gave the Prosecutor the anchor it needed.
The Hybrid Voice Loop and Stale React State
The hybrid auto-voice mode runs a while loop that can't close over React state — after the first render, all captured state is stale. The solution was refs updated by useEffect:
const messagesRef = useRef(messages)
useEffect(() => { messagesRef.current = messages }, [messages])
The loop always reads from refs. This pattern — obvious in hindsight, invisible until it breaks — cost us most of an afternoon.
The waitForNextMessage Promise parking pattern was the other subtle piece: rather than polling or busy-waiting, the loop parks by storing a Promise resolver that a useEffect fires whenever the messages array changes. Clean async coordination without any timers.
Agent vs. Serverless — Being Honest About What's Running
Early versions had the agent wired up but never actually running in the default dev command. The trial worked because the HTTP fallback was always there. We were getting the output of a stateless API call and calling it an "agent."
The fix was simple but required honesty first: npm run dev now runs three processes concurrently — Vite, the Express API server, and the LiveKit agent worker. If the agent isn't running, the fallback chain makes it clear in the console rather than silently succeeding. The per-trial memory and ReAct loop only activate when the agent is actually connected.
Verdict JSON Reliability
Groq's Llama models occasionally produce verdict JSON with markdown code fences, trailing commas, or missing fields — all of which break JSON.parse(). The solution was a two-layer approach: strip fences and retry parsing in the API handler, and use GPT-4o's native structured output mode (which enforces the schema at the model level) for the agent path. The Groq fallback still has a try/catch that returns a graceful error state rather than a blank screen.
What We Learned
Genuine adversarial AI is a different design problem than cooperative AI. Most prompt engineering literature is about making models helpful, agreeable, and safe. Making a model that argues against you — that doesn't let logical gaps slide, that escalates rather than deescalates, that invents plausible counter-evidence — required unlearning a lot of those instincts. The Prosecutor's system prompt was rewritten seven times before it stopped being subtly helpful.
Memory architecture is the hardest part of multi-agent systems. The models were the easy part. Deciding what each agent should remember, what it should write back, what it should share with other agents (Prosecutor's fallacy detections feed the Judge's memory), and how to scope it to a trial without a database — that took longer than any single LLM integration.
In voice applications, latency is UX. A 2-second pause in a text chat is fine. A 2-second pause after you finish speaking — when you're waiting for the Prosecutor to respond — feels like the system is broken. Groq's ~200ms inference was not a nice-to-have for this project. It was the difference between the trial feeling alive and feeling like a chatbot.
The while loop is underrated. React's useEffect-driven approach maps naturally to most UI patterns. But the hybrid voice mode is fundamentally sequential: speak, listen, submit, wait, repeat. A while loop with Promise parking maps to that flow in a way that a chain of effects never quite did. Sometimes the right abstraction isn't the idiomatic one.
Built With
React Vite Groq OpenAI GPT-4o OpenAI TTS LiveKit @livekit/agents Vercel Vitest Tailwind CSS
Built With
- 18
- ai
- css
- cssreact
- express.js
- functions
- gpt-4o
- groq
- html
- javascript
- livekit
- llama
- llms
- openai
- serverless
- speech
- tailwind
- tts
- vercel
- vite
- web
- webrtc
Log in or sign up for Devpost to join the conversation.