What Inspired Us Honestly, it started with an argument. The three of us were sitting together after a late night study session, going through a case study from our AI course — a real company that had greenlit a terrible acquisition because their entire risk team came from the same background, used the same mental models, and nobody in the room genuinely pushed back. The deal collapsed 18 months later and cost them everything. And one of us just said it out loud — "what if the people in that room were actually forced to disagree?" That stuck. We kept coming back to it over the next few days. What if you could build a system where disagreement wasn't optional, not cultural, not dependent on someone being brave enough to speak up in a meeting — but hardwired into the architecture itself? Where a CFO agent and a Marketing agent are literally designed to want different outcomes, and the only way forward is through a structured debate that neither can skip? That was the seed. Everything else grew from that one question.

What We Learned We went in thinking this was primarily an ML project. We came out realizing it was mostly a systems design problem with ML inside it. The technical learnings were real — we got deep into XGBoost gradient boosting, credibility-weighted voting mathematics, genetic mutation algorithms, and how to run local LLMs via Ollama without everything catching fire. But the more surprising lessons were softer ones: Disagreement is hard to engineer. Our first version had all five agents returning nearly identical risk scores on every single run. They shared the same base model, received the same features, and without deliberate noise injection the entire "debate" was five agents saying the same thing in different fonts. We had to intentionally break the determinism — adding calibrated Gaussian jitter, randomized LLM temperatures, role-specific feature weighting — just to make the agents genuinely conflict with each other. That felt counterintuitive at first. You're building an AI system and your main job for two days is making it less consistent. The credibility formula is deceptively political. The equation looks clean on paper: C(t+1)=α⋅C(t)+β⋅Performance+γ⋅Agreement−δ⋅HistoricalErrorC(t+1) = \alpha \cdot C(t) + \beta \cdot \text{Performance} + \gamma \cdot \text{Agreement} - \delta \cdot \text{HistoricalError}C(t+1)=α⋅C(t)+β⋅Performance+γ⋅Agreement−δ⋅HistoricalError But once we started running it we realized the γ\gamma γ term — the agreement reward — was quietly creating groupthink. Agents that agreed with the majority got rewarded regardless of whether the majority was right. An agent that was consistently correct but contrarian would drift toward lower credibility over time. That's not a bug in our code. That's how real organizational politics works. Fixing it properly requires ground truth labels we don't always have — and that tension never fully resolved. Local LLMs are humbling. Running deepseek-r1 and llama3 locally through Ollama means you're at the mercy of your hardware. Average LLM response time in our accuracy report was 62.75 seconds per agent. A full four-round boardroom debate with five agents can take north of 20 minutes on a mid-range machine. You learn very quickly to appreciate what it costs to generate one paragraph of coherent financial reasoning.

How We Built It We broke the system into layers and built bottom-up, which in hindsight was the right call — though it didn't feel that way at 2am when the pipeline was broken and nobody could remember what they'd changed. Week 1 was data and models. We generated synthetic financial datasets, trained the XGBoost base model, and built the feature engineering pipeline. The company financial health dataset came out well — 5000 records, 8 sectors, 36 features. The credit risk dataset we underestimated — 200 records felt fine until we realized we only had 31 defaults to train on, which is nowhere near enough for a classifier to learn real decision boundaries. That came back to haunt us. Week 2 was agents. We built BaseAgent first, got the three-step prediction pipeline working (ML risk → role risk → blend + jitter), then built all five role subclasses. The LocalLLMClient wrapper around Ollama was straightforward but brittle — we spent an embarrassing amount of time debugging why the LLM kept returning identical responses until we found that Ollama caches deterministic outputs when temperature and seed are fixed. Week 3 was the debate engine and voting. The four-round boardroom structure in boardroom_debate.py was the most satisfying thing to build — each round's prompt includes the full transcript of all prior rounds, so by Round 3 the agents are genuinely responding to arguments that were made two rounds ago. The weighted aggregator mathematics: final_risk=base_risk×0.5+confidence_penalty×0.3+dissent_factor×0.2\text{final_risk} = \text{base_risk} \times 0.5 + \text{confidence_penalty} \times 0.3 + \text{dissent_factor} \times 0.2final_risk=base_risk×0.5+confidence_penalty×0.3+dissent_factor×0.2 council_confidence=agreement_ratio×0.6+agent_confidence‾×0.4\text{council_confidence} = \text{agreement_ratio} \times 0.6 + \overline{\text{agent_confidence}} \times 0.4council_confidence=agreement_ratio×0.6+agent_confidence​×0.4 took several iterations to get right — the dissent factor was added late because early versions would return high confidence even when the vote was 3-2, which felt wrong. Week 4 was evolution, the FastAPI layer, and writing the report. The genetic mutation system was genuinely fun — the idea that an underperforming agent gets replaced by a mutated offspring of the best-performing one, with slightly perturbed weights and threshold, felt elegant once it was working.

The Challenges We Faced The determinism problem was the first major wall we hit and the most conceptually interesting. Five agents, same model, same input → five identical outputs → no debate, no meaningful vote, no credibility differentiation. The fix — Gaussian jitter plus randomized temperature plus random seed per LLM call — worked, but it felt like cheating at first. We eventually made peace with it by reframing it: the jitter isn't noise, it's epistemic uncertainty. Different executives genuinely would assess the same balance sheet differently based on their mood, their prior experiences, what they had for breakfast. The randomness is realistic. The ground truth problem never fully resolved. The credibility formula, the blend ratio adaptation, the evolution fitness scorer — all of them work best when you have outcome labels. Did the loan default? Did the acquisition succeed? In a real deployment you'd have this data flowing back 12-18 months later. In a research prototype you're mostly simulating it, which means some of the self-improvement mechanisms are running on noise rather than signal. We documented this honestly in the report rather than papering over it. Hardware constraints shaped the architecture more than we expected. Every design decision that adds an LLM call — memory-aware system prompts, richer debate rounds, CEO supervision reasoning — costs real seconds on real hardware. We ended up with a strict rule: no new LLM calls inside hot loops, only at the debate/session level. That constraint actually produced cleaner architecture. The CEO oversight gap was something we only fully articulated late in the project. The CEO agent delivers the final verdict in every debate, its word is the closing statement, and the existing evolution system essentially never removes it because it stays mid-to-high credibility by consistently agreeing with the group majority. We built the supervision and mutation layer for the CEO specifically because of this — and it became one of the more novel contributions of the whole system. An AI executive that can itself be replaced when its strategic judgment consistently diverges from the quantitative risk signals is a governance story that doesn't exist in most multi-agent systems.

The Part We're Most Proud Of Not the XGBoost model. Not the FastAPI layer. Not even the debate engine, which is the most visible part of the system. It's the credibility formula and what it implies. The idea that an agent's influence over a decision should be proportional to how right it has been historically — and that this weight should update dynamically, cycle by cycle, so the council self-corrects toward its most accurate members over time — that's a small idea with large consequences. It means the system gets better at being right the more it runs. It means a contrarian agent that keeps being correct will eventually dominate the vote even if it spent its early cycles being ignored. It means the council has memory. That felt, to all three of us, like the thing that made this more than a demo.

Built With

  • algorithm
  • architecture
  • boosting
  • chromadb
  • credibility-weighted
  • debate
  • deepseek-r1
  • fastapi
  • genetic
  • gradient
  • minmaxscaler
  • mistral
  • multi-agent
  • mutation
  • normalization
  • numpy
  • ollama
  • pandas
  • pr-agents)-deepseek-r1-(cfo
  • pydantic
  • pytest
  • rag
  • scikit-learn
  • via
  • voting
  • xgboost
Share this project:

Updates