GOVWORLD

Inspiration

Every infrastructure decision a government makes starts somewhere as a row in a spreadsheet: a road gets widened, a budget gets approved, a contract gets signed. But the people who actually live and work on that street can become an afterthought.

We kept asking one question:

What if decision-makers could see how a policy affects real people before a single brick is laid?

That became GOVWORLD: a human-first infrastructure simulator.

Instead of treating roads, budgets, and contracts as the final output, we treat them as inputs and simulate their effects on people’s daily lives.

What it does

GOVWORLD is a SimCity-style policy simulator that lets users place an infrastructure proposal into a living city of AI citizens, observe its consequences over time, hear competing expert perspectives, and identify accountability risks before public money moves.

Our demo models San Francisco’s Van Ness Avenue Complete Streets project, a roughly $45M and 18-month infrastructure effort.

A living population

The city contains around 50 AI citizens. Each has a name, job, family situation, income, concerns, aspirations, and a daily route through the neighborhood.

Users can click on citizens such as:

Jasmine, an SFMTA bus driver
Tony, a restaurant owner
Earl, a retired resident with diabetes and no car

They can then speak with these citizens by voice to understand how construction affects their lives.

An adversarial policy council

A council of ten AI expert personas, including an economist, transit engineer, climate analyst, community advocate, lawyer, and corruption watchdog, debates the proposed policy.

Each expert researches relevant live information, forms a distinct perspective, and argues through synthesized speech.

The goal is not to create fake agreement. The goal is to surface tradeoffs decision-makers may otherwise miss.

A 12-month city simulation

A generative-agent director advances the city through a 12-month timeline. It introduces realistic disruptions such as:

Utility conflicts
Weather delays
Contractor fraud
Permit challenges
Legal injunctions

As events unfold, citizens’ wellbeing changes from green to amber to red, showing how one infrastructure decision can ripple through jobs, mobility, health access, local business revenue, and trust in government.

An accountability ledger

GOVWORLD tracks contractors, budget lines, project milestones, delays, and cost overruns. It flags potential risks before they become invisible problems buried in project documentation.

How we built it

GOVWORLD is a browser-first application built with:

React 18, Vite, and TypeScript for the application
Zustand for state management
Leaflet and OpenStreetMap for the interactive city map
react-three-fiber for the 3D debate arena
A centralized llm.ts routing layer so models can be swapped by task

We used different AI providers based on what each task required:

Task	Technology
Citizen profiles, reactions, and council arguments	Gemini 2.5 Flash
Real-time voice chat and debate rebuttals	Groq / Llama 3.3 70B
Expert debate speech	Deepgram Aura-2
Live expert web research	Browserbase
LLM-as-judge evaluations	Anthropic Claude
LLM observability	Arize Phoenix

Our simulation engine is inspired by Stanford’s Generative Agents research. Citizens retain memory streams, and the director retrieves relevant memories using a weighted combination of recency, importance, and relevance.

score(memory) =
  recency_weight × recency(memory)
  + importance_weight × importance(memory)
  + relevance_weight × relevance(memory)

This lets past events meaningfully shape future citizen reactions instead of making each simulation step feel disconnected.

We also built the project with Claude Code as our pair-programmer.

How we used Arize Phoenix

We traced every council argument in Arize Phoenix and evaluated responses with a Claude-based judge on two criteria:

Whether the expert cited named sources
Whether the argument was coherent, specific, and evidence-based

The evaluation exposed a real failure mode. Without an explicit instruction, our experts cited named sources 0% of the time and relied on vague claims.

We added a requirement that each expert cite at least two named sources, then re-ran the evaluation.

Citation rate: 0% → 100%

That loop of tracing, evaluating, fixing, and re-evaluating made the system measurably more accountable instead of relying on whether the outputs merely “felt good.”

Challenges we ran into

Connecting browser-side AI calls with Python evaluation tooling

Arize Phoenix evaluations are Python-based, while our application runs in the browser.

We handled this through browser-side OpenTelemetry instrumentation and a CORS-enabled tracing workflow, without exposing API keys to users.

Provider reliability

During development, Groq was blocked on our network and Gemini’s free tier rate-limited heavily.

We made our model routing provider-agnostic and moved the evaluation judge to Claude so our evaluation loop could continue.

Making each simulation rerun genuinely different

Our first rerolls changed the edge-case panel but did not meaningfully affect the main story.

We expanded the event pool from 10 to 28 scenarios, sampled 5 to 7 seeded events per run, and injected them directly into the simulation timeline.

Now each run produces a different but coherent city story.

Demo reliability

Hackathon judges should not have to watch loading spinners.

We precomputed and cached all AI outputs for the primary demo scenario, allowing the full experience to run with zero live API calls if the network fails.

What we learned

Building generative agents is not just about creating personalities. It is about modeling memory, reflection, cascading consequences, and compounding failures.
LLM observability turns “this response feels weak” into an issue that can be measured, fixed, and re-tested.
Multi-provider AI routing improves resilience, but every external dependency needs a fallback plan.
The hardest part of simulating a city is not rendering the map. It is making every citizen feel like a person whose life deserves consideration.

What’s next

Scale from approximately 50 citizens to 1,000 through our prototype social-opinion swarm pipeline.
Allow governments, researchers, and communities to upload a real proposed policy for any neighborhood.
Connect the Arize evaluation loop to CI so every prompt change is automatically tested for evidence quality, safety, and reasoning.
Expand the accountability ledger into a more robust early-warning system for delays, overruns, and procurement risks.