Inspiration
Most online debate fails at the same moment: when one side is finally forced to engage with the opponent's strongest argument. It never happens. You get two people talking past each other, each citing the sources that confirm what they already believe. We wanted to build something that forces the confrontation. Counterpoint generates a live, structured debate on any topic — where each agent has to actually respond to what the other just said, using evidence it pulled from the web that day. Not a summary. A fight.
What it does
Counterpoint spins up two browser agents that independently research live web sources for opposite sides of any topic — then stages a structured Oxford-style debate between them, streamed token by token, grounded in evidence retrieved minutes ago. Users enter a topic. Two research agents hit the web simultaneously. Three AI agents — a Moderator, Debater A, and Debater B — then argue through ten structured turns: openings, rebuttals, rapid-fire, and a neutral closing summary. Every claim is tied to a source. Every rebuttal has to reckon with what was just said.
Unlike traditional research tools, Counterpoint doesn't just present viewpoints — it shows how they actively respond, challenge, and refine each other, grounded in sources from today.
How we built it
Frontend
- Next.js 16 + React 19
- Tailwind CSS v4 + Framer Motion for animations
- TypeScript
- Server-Sent Events (SSE) for token-by-token streaming
- Supabase JS client for auth and debate history
Backend
- FastAPI + Uvicorn (async throughout)
- OpenAI gpt-4o for debate reasoning; gpt-4o-mini for summarization, fact-checking, and evidence expansion
- browser-use Cloud SDK v3 — managed stealth browser that accepts natural-language tasks and returns structured output; handles JS rendering, CAPTCHAs, and stealth automatically
- LlamaIndex — full RAG pipeline with per-agent, per-debate scoped vector retrieval
- Supabase (PostgreSQL + pgvector) — vector store for embeddings, persistent debate history, and auth
- Pydantic v2 for structured browser-use outputs
- PyJWT for authentication
AI + Agent System
The orchestrator runs a hardcoded Oxford-style STAGE_PLAN — ten turns from moderator opening through rapid-fire to closing summary. Each stage is a StageSpec that declares which agent speaks, what instruction key to inject, and whether retrieval is required. A _select_next_stage() function advances through the plan; the frontend always knows the exact phase via SSE event boundaries.
Agents receive carefully engineered system prompts structured around claim → warrant → evidence → impact/weighing. The prompts include a full logical fallacies toolkit (slippery slope, straw man, ad hominem, etc.) as deliberate rhetorical tactics, layered with Ethos/Pathos/Logos framing — deployed naturally, never labeled.
Challenges we ran into
1. Research-to-debate pipeline
The biggest challenge was getting the research-to-debate pipeline to run reliably end-to-end.
On the browser side, each agent needed live web sources before it could begin arguing. Initially, this process was sequential and took nearly three minutes before any debate content was generated. To address this, we parallelized the pro and con research using asyncio.gather, which cut the wait time roughly in half. We also introduced a 135-second hard timeout to prevent the system from stalling on slower topics. To improve user experience during this delay, we streamed a live moderator brief via server-sent events (SSE), so users could see content immediately while research was still in progress.
However, the more difficult issue emerged downstream in the generation phase. When prompts were too open-ended, agents produced responses that were overly long or structurally ambiguous. This frequently pushed generation beyond LLM timeout limits. When a turn timed out, the retrieval-augmented generation (RAG) step that preceded it was effectively wasted, and agents would fall back to model memory instead of grounded evidence. This failure mode was silent but critical, as it undermined the entire purpose of the system.
The solution required changes at both levels of the pipeline. On the research side, we maintained parallelized browser retrieval. On the generation side, we introduced strict, role-specific prompting: separate, tightly constrained prompts for openings, rebuttals, and rapid-fire rounds. These prompts enforced clear structure and length limits, ensuring that responses stayed within timeout bounds and that RAG outputs were consistently utilized.
The key takeaway is that in a multi-agent system, prompt design is not just about output quality—it directly impacts system reliability. Poorly constrained prompts can cascade into failures across downstream components, effectively disabling critical parts of the pipeline.
2. Streaming Captions Landing on the Wrong Speaker
With 10 sequential debate turns streaming token-by-token, we encountered a race condition where tokens arrived from the backend faster than the frontend could transition between speakers. As a result, Agent B’s words would sometimes appear in Agent A’s speech bubble, or entire turns would render out of order.
The root issue was that the frontend had no reliable way to associate incoming tokens with a specific turn. It only knew the agent and the text, but not which turn the token belonged to. We resolved this by implementing a proper turn buffer system. Incoming tokens are now queued and keyed by turn number. The frontend only advances to the next speaker once the current turn’s animation has fully completed. Additionally, every token now carries its turn index from the backend. This ensures that even if tokens arrive early or out of sync, they are always rendered in the correct order and assigned to the correct speaker.
Accomplishments that we're proud of
- Designed a complete adversarial RAG pipeline: browser-use → LlamaIndex → Supabase pgvector, with per-agent, per-debate scoped retrieval so each debater only draws from its own research
- Cut pipeline latency from ~3 minutes to under 60 seconds through parallelized browser research and SSE streaming — the user sees live moderator commentary while agents are still gathering sources
- Built a Moderator agent that actively improves debate quality mid-debate: it identifies unresolved clashes, asks testable follow-up questions, and closes with a sourced neutral summary
- Engineered prompt architecture around claim → warrant → evidence → impact/weighing per turn, with separate TURN_INSTRUCTIONS per stage, plus an AgentRetrievalMemory that tracks cited sources across turns so arguments build breadth instead of repeating
- Built three independent fallback layers so the system degrades gracefully rather than failing: URL sanitization, gpt-4o-mini snippet expansion for thin scrapes, and step URL harvesting from browser-use progress streams
What we learned
- System design matters more than model choice — parallelizing browser-use and adding streaming UI tricks had a bigger latency impact than any model upgrade
- Multi-agent systems require strict orchestration — a hardcoded state machine with explicit stage boundaries is more reliable than emergent turn-taking
- Real-time UX dramatically improves perceived intelligence — streaming the moderator brief during research masked the wait and made the system feel alive
- RAG is only as good as your ingestion pipeline — we spent more engineering time on URL sanitization, snippet expansion, and fallback harvesting than on the retrieval query itself
- Prompts need structure, not just instructions — claim → warrant → evidence → impact/weighing per turn, with different instructions per turn type, produced far better debates than a single general system prompt
What's next for Counterpoint
The most immediate use case we're building toward is a browser extension: highlight any claim in an article, and Counterpoint generates a debate around it in a sidebar — sourced, structured, and live. One click from confirmation bias to confrontation. Beyond that:
Exportable debate summaries formatted for academic use — APUSH, philosophy, law, policy Debate modes: historical figures arguing across time, policy simulations with real legislative context, Socratic dialogue for single-concept deep dives Live Moderator fact-checking with inline citation badges, surfaced during the debate rather than after Side-by-side comparison of multiple debates on the same topic over time
The long-term goal is a shift in default behavior: when you encounter a controversial claim, your first instinct is to generate a debate — not scroll through comments.
Built With
- browser-use
- fastapi
- llama-index
- nextjs
- openai
- supabase
Log in or sign up for Devpost to join the conversation.