Civica

Inspiration

When the OSAP cuts were announced, the protests happened right outside our dorms. We saw thousands of students rally against a policy that had blindsided them. But it wasn't just something we observed from our windows; we have friends who rely on OSAP to stay in school, and overnight, their financial plans fell apart. It made us ask a simple question: did anyone model what would happen to students before this was announced? We built Civica because we believe policymakers should be able to stress-test a policy against the people it affects before it becomes a headline and a protest. Not to replace debate, but to inform it, so that the next time a decision like this is made, the risks are on the table from the start.

What it does

Civica takes any proposed Canadian policy and runs it through a two-round AI simulation. First, 8 domain specialists: a labor economist, urban planner, fiscal analyst, housing market economist,
and others. Each analyzes the policy through their expertise to identify risks it would create. Then, 50 demographic personas representing real Canadians across 20 cities (from young renters in Toronto to retirees in Kelowna to families in Nunavut) validate those risks against Statistics Canada data for their city. Each persona confirms which risks actually affect someone like them, rates the severity, and flags anything the specialists missed. The result is a ranked risk report that shows not just what could go wrong, but also how many different types of Canadians would feel it, where it hits hardest, and why, with every claim grounded in real government data, not speculation.

How we built it

We started with a core question: how do you simulate the impact of a policy across an entire country without just asking one AI for its opinion? Our answer was a two-layer architecture inspired by the Delphi method from forecasting research.

The backend is Python with asyncio, letting us run dozens of AI agents in parallel. We use Backboard.ai to orchestrate calls across multiple LLM providers through a single API: GPT-4o for the specialists and coordinator, where deep reasoning matters, and Claude 3 Haiku for the 50 demographic validators, where we need breadth and speed.

For data grounding, we built a pipeline that pulls real numbers from the Statistics Canada API: vacancy rates, average rents, housing starts, unemployment rates, population estimates, and income breakdowns by age group across 20 Canadian cities. Every agent prompt includes the actual numbers for its city, so the analysis is anchored in reality rather than training data.

The frontend is built with React and connects to a FastAPI backend using server-sent events, so users can watch the simulation progress in real time: specialists reporting in, validators confirming risks, and the final report assembling piece by piece.

Challenges we ran into

Calibrating the risk assessment was our biggest challenge. Early versions of Civica had every single Specialist flagging every policy as high risk. This sucked because it doesn't accurately represent the effect policies could have on the population.
The breakthrough was defining what LOW, MEDIUM, and HIGH actually mean in concrete terms. Instead of letting the models decide severity subjectively, we quantified it: LOW affects a small or narrow group, MEDIUM affects a meaningful share of a demographic group across multiple cities, and HIGH affects a large share of a vulnerable population nationally and is likely to compound with other risks. That single change gave the models a framework to reason about severity consistently, and
The reports went from "everything is catastrophic" to nuanced assessments where some risks genuinely are low, and that's a useful signal.

Accomplishments that we're proud of

The risk reports actually hold up to scrutiny. When we ran our default test policy — "Canada builds 500,000 new homes over 3 years" — the system didn't just say "housing is expensive." It identified infrastructure strain from rapid construction, building risks from rushed timelines, and displacement of low-income residents from gentrification. Those are risks that urban planners and policy analysts would recognize — not generic AI filler.

The confirmation signal is something no single AI call can produce. When 48 out of 50 demographic personas independently confirm a risk against their own city's real data, that means something different than one model saying "this is a big deal." And when only 8/50 confirm a risk to Indigenous communities, that's not a failure — it's telling you the risk is real but narrow. That gradient from 48 to 8 is the whole point.

Every number is grounded in real government data. We made a deliberate choice early on to never let the models make up statistics. Every agent prompt includes real vacancy rates, rents, unemployment figures, and income data pulled directly from Statistics Canada.

What we learned

AI agents are only as good as their architecture. We learned that throwing 50 identical agents at a problem doesn't give you 50 perspectives — it gives you the same answer 50 times. The real power comes from designing a system where each agent has a distinct role. Once we separated domain specialists from demographic validators, the same cheap model that was useless at discovering risks became excellent at confirming them. The right job for the right agent matters more than the model you pick.

You can run dozens of AI agents in parallel for pennies. We assumed running 50+ LLM calls per
simulation would be prohibitively expensive or slow. In practice, by using an expensive model for
the 8 calls that need deep reasoning and a cheap model for the 50 calls that need breadth, a full simulation costs $0.3 and runs in under 5 minutes. Concurrency does the heavy lifting — what would take 15 minutes sequentially finishes in 3.

Real data is a cheat code for AI quality. The single biggest improvement to our output wasn't a better model or a smarter prompt — it was feeding real Statistics Canada numbers into every agent call. When a model has to reason about Toronto's 3.0% vacancy rate and $1,761 average rent instead of vibes from its training data, the analysis gets dramatically more grounded.

What's next for Civica

Stronger models, deeper data. Our immediate next step is upgrading the specialist agents to more capable reasoning models — the architecture is model-agnostic, so swapping in a stronger model is a one-line change that instantly improves every risk assessment. Richer data grounding. Right now, we pull from six Statistics Canada tables. We want to layer in CMHC housing data, CRA tax statistics, provincial budget figures, and — critically — legal and regulatory data. A policy that sounds good economically might conflict with existing provincial legislation or municipal zoning bylaws. Grounding the analysis in a legal context would catch a whole category of implementation risks that pure economic analysis misses.

Policy comparison mode. We want policymakers to be able to run two competing proposals side by side — "build 500k homes" vs "implement a 2% vacancy tax" — and see which one carries more risk, for
whom, and where.