All Rise — About the Project

Inspiration

Most AI demos are cooperative. You ask, it helps. The interaction is flat — the AI defers to you, agrees with you, and exists to make your life easier.

We wanted to build something different: an AI that argues back.

The idea came from a simple frustration. We've all been in arguments we lost not because we were wrong, but because we weren't prepared — we didn't see the attack coming, we repeated ourselves, we used weak evidence. What if you could practice adversarial thinking against an opponent that never gets tired, never lets a logical gap slide, and remembers every mistake you've made?

The courtroom format was an obvious fit. It has structure (phases, turns, a verdict), clear stakes (you can lose), and enough absurdity in the premise to make it genuinely fun. Nobody wants to practice arguing under pressure. But everyone will defend themselves against the charge of "Unlicensed Philosophy — deploying 'but what even is reality?' at a neighborhood barbecue without a PhD."

What We Built

All Rise is a fully realized AI courtroom simulation. You are the defendant. The charge is absurd. You must defend yourself across five structured phases while:

Reginald P. Harrington III (the Prosecutor) invents evidence, exploits your logical gaps, and escalates pressure each round — adapting his strategy based on everything you've said
Judge Constance Virtue watches in silence, accumulates scores round by round, then delivers a structured verdict at the end
The Strategist offers tactical hints on demand — without playing the trial for you

The trial takes ~4 minutes. There is no guaranteed outcome. You can lose.

How We Built It

Three Real Agents, Not Three Named API Calls

The first architectural decision — and the one everything else depends on — was making the agents genuinely agentic rather than just prompts with personality names.

A naive implementation sends the same system prompt every round. The prosecutor has no memory of what it said in round 1 when it's in round 3. It repeats evidence. It re-opens closed arguments. The simulation falls apart.

The solution is per-trial agent memory stored in the LiveKit agent process:

$$\text{TrialMemory} = {\text{prosecutorMemory}, \text{judgeMemory}, \text{defenseMemory}, \text{fullTranscript}}$$

Each trialId (a UUID generated at startTrial()) maps to its own memory object in a Map<trialId, TrialMemory>. Memory is written after every round and read before every prompt is assembled.

The Prosecutor — ReAct Loop

Every round, before generating a single word, the Prosecutor runs through four deterministic tool calls:

recallWeaknesses(defenseText)   → Groq: "what logical gaps does this expose?"
detectFallacy(defenseText)      → Groq: "did they commit a named logical fallacy?"
recallAttackStrategy(memory)    → pure read: current strategy string
getUnusedEvidence(memory)       → pure read: evidence types not yet cited

The results are injected into the prompt as structured context. The Prosecutor then generates its cross-examination using that analysis — not a blank slate. After responding, it writes back: what evidence it used, what weakness it exploited, a one-sentence round summary for the Judge's memory, and an updated attack strategy.

This means by round 3, the Prosecutor knows:

What you argued in rounds 1 and 2
Which of its attacks landed and which you deflected
What evidence it has already deployed (and therefore cannot repeat)
What your rhetorical tendencies are

The Judge — Chain of Thought with Structured Output

The Judge runs exactly once — after your closing argument — with access to the full trial memory and transcript. We used GPT-4o with strict JSON schema (structured outputs mode) rather than Groq here because the verdict is the one moment where JSON correctness matters more than speed.

Before the LLM call, three tools run:

tallyFallacies(memory)           → all logged fallacies from every round
computeScores(memory)            → weighted average of per-round scores
checkVerdictConsistency(scores)  → pre-check: what do the numbers imply?

Score weighting across rounds:

$$\text{finalScore}d = 0.15 \cdot r_1 + 0.20 \cdot r_2 + 0.25 \cdot r_3 + 0.40 \cdot r{\text{closing}}$$

where $d \in {\text{strength, evidence, logic, persuasion}}$ and $r_i$ is the round $i$ score for that dimension.

The closing argument counts for 40% of the final score. First impressions matter less than last words.

Verdict threshold:

$$\text{verdict} = \begin{cases} \text{Not Guilty} & \text{if } \sum_d \text{finalScore}_d \geq 24 \ \text{Guilty} & \text{otherwise} \end{cases}$$

The Judge can override this threshold if the argument was genuinely exceptional in either direction.

Voice — Three Modes

We built three interaction modes so the app works for any environment:

Text mode — type your defense, with an optional mic button that fills the textarea using Web Speech API STT as you speak (non-auto-submit, so you can edit before sending).

Hybrid auto-voice — a while loop that runs until verdict: speak the prosecution's message via OpenAI TTS, auto-activate mic, capture speech with live interim transcript display, submit on silence. Everything runs in the browser — no WebRTC, no server audio.

Live voice — full WebRTC via LiveKit. Your mic streams to the server-side agent. STT, LLM, and TTS all run server-side. The browser is a thin client showing a transcript and the session state.

Model Selection

Agent	Model	Reason
Prosecutor	Groq `llama-3.3-70b-versatile`	~200ms response time — cross-examination needs to feel immediate
Judge	OpenAI `gpt-4o`	Structured JSON output reliability; runs once so latency doesn't matter
Defense Assistant	Groq `llama-3.3-70b-versatile`	Same speed requirement as Prosecutor
TTS	OpenAI `tts-1-hd`	Three distinct voices: `onyx` (Prosecutor), `shimmer` (Judge), `alloy` (Defense)

Infrastructure

Browser (React 18 + Vite)
    │
    │  HTTP /api/*          LiveKit data channel
    ▼
API Layer (Vercel serverless)
    │
    │  LiveKit room
    ▼
Agent Process (@livekit/agents)
    ├── prosecutorAgent  (ReAct, Groq)
    ├── judgeAgent       (chain-of-thought, GPT-4o)
    └── defenseAssistant (adaptive hints, Groq)

The API layer serves as a graceful fallback — if the LiveKit agent isn't running, every call routes directly to the serverless functions. You lose per-trial memory and the ReAct loop, but the trial still works.

Challenges

Making the Prosecutor Feel Like It's Actually Tracking You

The hardest part wasn't the AI — it was making the AI feel present. An LLM with no memory produces a prosecutor that could be playing a completely different trial. It might cite the same piece of evidence twice. It might attack an argument the user never made.

The memory system fixed the first two problems. The third — the prosecutor attacking a straw man — required the recallWeaknesses tool to explicitly ground the attack in the actual defense text from that round, not a hallucinated version of it. Injecting the tool result directly into the prompt ("these specific weaknesses were found in what they just said") gave the Prosecutor the anchor it needed.

The Hybrid Voice Loop and Stale React State

The hybrid auto-voice mode runs a while loop that can't close over React state — after the first render, all captured state is stale. The solution was refs updated by useEffect:

const messagesRef = useRef(messages)
useEffect(() => { messagesRef.current = messages }, [messages])

The loop always reads from refs. This pattern — obvious in hindsight, invisible until it breaks — cost us most of an afternoon.

The waitForNextMessage Promise parking pattern was the other subtle piece: rather than polling or busy-waiting, the loop parks by storing a Promise resolver that a useEffect fires whenever the messages array changes. Clean async coordination without any timers.

Agent vs. Serverless — Being Honest About What's Running

Early versions had the agent wired up but never actually running in the default dev command. The trial worked because the HTTP fallback was always there. We were getting the output of a stateless API call and calling it an "agent."

The fix was simple but required honesty first: npm run dev now runs three processes concurrently — Vite, the Express API server, and the LiveKit agent worker. If the agent isn't running, the fallback chain makes it clear in the console rather than silently succeeding. The per-trial memory and ReAct loop only activate when the agent is actually connected.

Verdict JSON Reliability

Groq's Llama models occasionally produce verdict JSON with markdown code fences, trailing commas, or missing fields — all of which break JSON.parse(). The solution was a two-layer approach: strip fences and retry parsing in the API handler, and use GPT-4o's native structured output mode (which enforces the schema at the model level) for the agent path. The Groq fallback still has a try/catch that returns a graceful error state rather than a blank screen.

What We Learned

Genuine adversarial AI is a different design problem than cooperative AI. Most prompt engineering literature is about making models helpful, agreeable, and safe. Making a model that argues against you — that doesn't let logical gaps slide, that escalates rather than deescalates, that invents plausible counter-evidence — required unlearning a lot of those instincts. The Prosecutor's system prompt was rewritten seven times before it stopped being subtly helpful.

Memory architecture is the hardest part of multi-agent systems. The models were the easy part. Deciding what each agent should remember, what it should write back, what it should share with other agents (Prosecutor's fallacy detections feed the Judge's memory), and how to scope it to a trial without a database — that took longer than any single LLM integration.

In voice applications, latency is UX. A 2-second pause in a text chat is fine. A 2-second pause after you finish speaking — when you're waiting for the Prosecutor to respond — feels like the system is broken. Groq's ~200ms inference was not a nice-to-have for this project. It was the difference between the trial feeling alive and feeling like a chatbot.

The while loop is underrated. React's useEffect-driven approach maps naturally to most UI patterns. But the hybrid voice mode is fundamentally sequential: speak, listen, submit, wait, repeat. A while loop with Promise parking maps to that flow in a way that a chain of effects never quite did. Sometimes the right abstraction isn't the idiomatic one.

Built With

React Vite Groq OpenAI GPT-4o OpenAI TTS LiveKit @livekit/agents Vercel Vitest Tailwind CSS

Built With

18
ai
css
cssreact
express.js
functions
gpt-4o
groq
html
javascript
livekit
llama
llms
openai
serverless
speech
tailwind
tts
vercel
vite
web
webrtc

Updates

Srikrishna Venkatesh started this project — May 02, 2026 12:35 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.