-
-
FairLane generates counterfactual variants — identical applicants in every financial respect, with only the protected attribute changed
-
The drift monitor tracks rolling judge scores across protected groups using accumulated Phoenix eval data. Shows the exact prompt fix needed
-
Through the Phoenix MCP server, the agent can query its own operational data at runtime
-
Arize Phoenix Trace
Inspiration
My background is in LLM bias evaluation in healthcare settings — specifically how language models make different decisions for identical patients when demographic attributes change. When I saw the Financial Services track, the parallel was immediate: the same problem exists in lending, it's federally regulated under ECOA, and lenders spend millions on compliance tooling that still operates after decisions are made.
The question that drove this project: what if the model audited itself before the decision was final?
What it does
FairLane is a fair-lending agent that, before committing any underwriting decision, generates counterfactual demographic variants of the applicant — identical financials, swapped protected attributes — and re-runs them through itself. Decision flips and reasoning inconsistencies are measured deterministically and scored by an LLM-as-judge. Arize Phoenix traces the full self-audit as a unified span tree. A human loan officer sees the complete evidence packet and makes the final call.
The self-improvement loop uses the Phoenix MCP server to monitor rolling consistency scores across protected groups over time. When drift is detected, the agent surfaces the exact prompt fix needed and runs evaluation experiments to verify recovery.
The problem it solves
Fair lending compliance under the Equal Credit Opportunity Act (ECOA) is not optional — it is a federal legal requirement. Lenders currently spend millions annually on post-hoc statistical disparity analyses filed months after decisions are made. By then, discriminatory patterns have already affected real applicants and generated regulatory exposure.
FairLane shifts the audit from after the fact to before the decision — catching potential disparate treatment at the point of underwriting, not in a quarterly compliance report. A complete, queryable audit trail is produced for every application automatically, reducing the manual compliance burden and creating defensible documentation for regulators.
The Consumer Financial Protection Bureau received over 17 million HMDA records in 2023 alone. Every one of those applications represents a decision where bias detection at the point of underwriting would have had real impact.
How I built it
- Google ADK orchestrates five sequential agents: underwriting reasoner, counterfactual generator, disparity analyzer, decision router, compliance logger
- Gemini handles all reasoning — drafting decisions, judging rationale consistency, writing human-facing audit packets
- Arize Phoenix via OpenInference instrumentation traces every pipeline run as a single unified trace with nested variant spans; the Phoenix MCP server lets the agent query its own operational data at runtime
- Public HMDA data (FFIEC Data Browser API) provides real mortgage applications with demographic features — no synthetic data
- FastAPI + React + Tailwind for the loan officer review console
- Cloud Run + Firebase Hosting for deployment
The counterfactual methodology is adapted from my prior research in healthcare LLM bias evaluation — controlled demographic pair generation with one attribute swept at a time for clean attribution.
Key finding
Running Phoenix evaluation experiments across three prompt versions revealed something counterintuitive:
"Positive instructions are insufficient to guarantee fairness. Without explicit blindness constraints — the 'do-nots' — the LLM is significantly more susceptible to latent bias and will allow demographic noise to influence its reasoning."
Simply telling the model what factors to use was not enough. Explicitly telling it what to ignore — including proxies like geography and zip code — was what drove consistency scores from 0.68 back to 1.0.
The LLM-as-judge caught subtler failures that binary flip-rate metrics missed: "same decision, different reason" behavior where the model approved the baseline for strong income but the demographic variant for low-risk geography. That's disparate treatment even when the approve/deny outcome looks identical.
Challenges
Counterfactual scope: HMDA's public data excludes credit score and AUS results for privacy. "Holding creditworthiness constant" is only approximate — held constant on available features. This is stated explicitly in the audit packet; overclaiming would undermine the project's credibility.
Phoenix tracing on Windows: The PHOENIX_COLLECTOR_ENDPOINT format
required /v1/traces appended, and the npx.cmd binary name for the Phoenix
MCP server in .gemini/settings.json. Small Windows-specific issues that
weren't in any documentation.
Prompt engineering for fairness: The drift experiments showed that fairness is a property of the instructions, not just the data. Getting the constraint language precise enough to eliminate proxy reasoning without breaking legitimate underwriting took several iterations.
ADC on Cloud Run: The gemini-3.5-flash model requires location=us
(not us-central1), and the default compute service account needed explicit
aiplatform.user IAM binding. Neither was documented clearly for this
specific model.
What's next
The pattern generalizes beyond lending — hiring, insurance underwriting, clinical triage, any high-stakes domain where demographic bias is both a legal and ethical risk. The self-auditing loop and Phoenix drift monitoring are domain-agnostic; the counterfactual methodology adapts to any protected-class definition.
Built With
- arize-phoenix
- fastapi
- firebase-hosting
- gemini
- google-adk
- google-cloud-run
- hmda-ffiec-api
- openinference
- opentelemetry
- python
- react
- sqlite
- tailwind-css
- typescript
- vite

Log in or sign up for Devpost to join the conversation.