Inspiration

Medical AI can report 94% accuracy while failing older adults, racial minorities, rural hospitals, and high-vulnerability communities. The Optum risk scorer, Epic Sepsis Model, and pulse-oximetry equity gaps are all documented cases where deployed AI harmed exactly the patients least represented in the validation study, and in each case, the warning signs were visible in the paper before deployment. Deployment decisions shouldn't depend only on headline metrics. Decision-makers need to understand who was excluded, where performance will deteriorate, and what happens to those communities if no one catches it. Our guiding question: Can AI transform buried academic limitations into transparent, population-specific deployment decisions before harm occurs?

What it does

PolicyGuard + CommunityGuard is a two-task multi-agent audit system for medical-AI papers, built around the principle AI proposes evidence, humans decide.

PolicyGuard answers "should this AI be deployed AT ALL?" Eleven specialist agents (6 PolicyGuard + 5 Domain) independently examine cohort representation, fairness, statistical rigor, efficacy claims, deployment feasibility, citation fidelity, methodology, clinical translation, contamination, statistical power, and ecological fallacy. Each finding includes a paper-specific claim, a verbatim evidence span, severity, confidence, the affected subgroup, and a standardised disease code.

CommunityGuard answers "should we apply this evidence to THIS specific community?" Seven agents anchored in the RE-AIM implementation-science framework (Reach, Effectiveness, Adoption, Implementation, Maintenance) plus an explicit Equity cross-cut. Output: a six-dimensional scorecard plus an apply / do_not_apply / apply_with_conditions recommendation grounded in CDC PLACES, CDC SVI, Census ACS, CMS Hospital Compare, HRSA shortage areas, and Synthea synthetic cohorts.

Either task can run in four pipeline modes selectable from the sidebar: ChatGPT zero-shot (vanilla baseline), LLM Agents, +RAG (FAISS+BM25 hybrid over 8K-paper PubMed Central OA), and +RAG+OpenAlex (live literature search). A side-by-side Compare modes view lets the user pick any two modes and watch them run on the same paper.

Layer-2 simulator: grounded findings enter a 200-iteration Monte Carlo that projects affected population, missed-diagnosis ranges with 95% CI, high-vulnerability-tract concentration, and hospital-supply risk.

Every limitation and recommendation is reviewable by a named human role — Reviewer, Editor, Clinician, or AI Governance Committee — with accept / dismiss / escalate logging into an auditable trail.

How we built it

We designed PolicyGuard as a layered decision system rather than a single general-purpose prompt. It combines a stable demonstration backbone with an offline reinforcement learning and preference-alignment pipeline, ensuring rigorous safety and local deployment capabilities.

Why Multi-Agent Over a Single Domain Model We chose a multi-agent architecture for five critical reasons: Specialization: Reduces context confusion. Each agent stays in its lane via "don't double-flag adjacent territory" prompt clauses. Uncorrelated Errors: Independent failure modes provide a true consensus signal. Cross-agent agreement is real evidence, not just a training-data echo. Updateability: If the FDA releases new subgroup-reporting guidance, we simply edit one agent's prompt without a massive retraining cycle. Interpretability: Reviewers can see exactly which agent surfaced what limitation. A fine-tuned monolith remains a black box. Cost Efficiency: Running eleven GPT-4o-mini calls (or a sequential local Qwen-3 32B AWQ) is vastly more cost-effective than maintaining a monolithic model requiring quarterly re-alignment.

The Offline RL & Alignment Pipeline Our local inference pathway utilizes Qwen2.5-3B-Instruct, enhanced by two specialized adapters: an SFT adapter for the Master/Leader roles (coordination and synthesis), and a DPO adapter for the Worker agents (cohort, fairness, statistical, and deployment analysis).

For worker alignment, we generated multiple candidate rollouts for each agent following a three-stage trajectory: SEARCH: Form targeted queries and retrieve relevant scholarly or web passages. READ: Extract source-faithful spans related to the agent’s specialty. SYNTHESIZE: Produce structured, evidence-grounded Limitation Units.

Instead of outcome-only scoring, our step-wise process reward model evaluates every stage. SEARCH rewards query specificity; READ rewards verbatim fidelity; and SYNTHESIZE carries the highest weight, evaluating schema validity, evidence anchoring, citation consistency, subgroup identification, and severity. Additionally, a DeBERTa NLI scorer checks whether external evidence supports, contradicts, or is neutral toward each generated claim.

These accumulated trajectory rewards rank rollouts into preferred and rejected pairs to train the worker adapter using Direct Preference Optimization (DPO). Crucially, PolicyGuard does not perform live reinforcement learning on user data. Inference uses frozen adapters, ensuring the application remains predictable and safe during deployment.

The Inference Pipeline During inference, the inter-agent communication protocol ensures workers run in parallel with zero cross-talk, preserving the integrity of the consensus signal. Their outputs converge through a four-stage pipeline:

Parallel Worker Execution (11 Agents): The roster analyzes the manuscript through specific analytical lenses (e.g., demographic fairness, statistical rigor) to surface vulnerabilities. The Evidence Anchoring Gate: A strict, mechanical verification layer that performs verbatim span retrieval to ensure the agent's claim exists in the original manuscript.

The Leader Cross-Check: Verified claims are semantically deduplicated. A "consensus boost" is applied, elevating the severity score if multiple independent workers flag the exact same vulnerability. The Master Merger: Powered by our Master model, this step weaves the disparate technical and demographic critiques into a cohesive, actionable deployment report.

The Five-Layer Hallucination Defense Strategy To ensure absolute reliability in high-stakes healthcare scenarios, the system employs a cascading hallucination filter: Layer 1: Structural Validation (Pydantic & Instructor): Malformed JSON cannot reach the gate. Outputs are forced into strict schemas; failures are instantly dropped. Layer 2: The Boilerplate Blocklist: A rapid heuristic filter eliminates low-value academic filler (e.g., "more research is needed"). Layer 3: The Evidence Anchoring Gate: A zero-LLM string-matching function cross-references the agent's verbatim text span against the original manuscript chunks. Layer 4: Multi-Agent Consensus Mechanics: Agreement across isolated agents uplifts severity; isolated or anomalous claims do not receive this boost, suppressing outliers. Layer 5: Reasoning-Before-Commitment (The LLM Judge): The Pydantic schema forces the reasoning field to populate before the final is_match boolean, computationally forcing the judge to justify its logic before making a final commitment.

Dual Tier Retrieval Strategy When augmented generation is selected, the architecture utilizes a dual-tier strategy. The foundational tier is a hybrid RAG corpus (8K-paper PubMed Central OA subset) blending BM25 lexical scoring (30%) with dense embeddings (70%) via PubMedBERT within a FAISS index. The secondary tier introduces live search via the OpenAlex API (250M works) with cascading fallbacks and rate-limit retries. Agents are explicitly forbidden from defaulting to the manuscript when utilizing external context and must accurately tag their evidence_source.

CommunityGuard Agent Infrastructure & Narrative Interpretation CommunityGuard utilizes this identical infrastructure but deploys a public-policy-focused roster: Evidence Quality, Reach Equity, Access Adoption, Cultural Linguistic, Economic Sustainability, Implementation Capacity, and Decision Synthesis. A secondary LLM pass generates a qualitative narrative for each data source, translating raw numbers into regional comparisons and deployment considerations so users understand the human implications. Deterministic Mathematical Simulation The final layer removes the language model entirely in favor of deterministic Python. The Monte Carlo population-impact simulator runs 200 uncertainty draws over efficacy degradation, joined directly to real CDC, Census Bureau, and CMS data tables. Separating language reasoning from population mathematics guarantees that every resulting risk interval and projection remains fully auditable.

Challenges we ran into

Building a deterministic policy simulator on top of stochastic language models presented severe structural, evaluative, and infrastructure hurdles.

The Qualitative to Quantitative Bridge The hardest foundational challenge was connecting qualitative criticism to quantifiable impact. A generated sentence stating that older adults were underrepresented is useless to a mathematical simulator until it carries a specific subgroup, a disease code, an evidence source, a severity rating, and a defensible efficacy penalty. We solved this by forcing every LimitationCandidate through a strict Pydantic schema using the instructor library, treating those variables as first class fields. Our simulator consumes structured mathematical objects, not prose.

Hallucination Control Under Extreme Context Pressure Medical critiques can sound incredibly plausible while being entirely unsupported. While our five layer filter handles basic hallucinations, adding dual tier RAG and live OpenAlex retrieval pushed agent prompts to the absolute edge of the model context window, causing catastrophic JSON truncation crashes.

We mitigated this by engineering specific limits. First, we implemented per role input caps where the Equity Agent receives unified local data plus exactly 4500 characters of the paper, while others get 5500. Second, we enforced strict output constraints limiting findings to four and evidence quotes to 200 characters. Third, we built cascading API fallbacks. OpenAlex rate limiting broke our early demos, which we solved with cascading fallback queries from 24 informative keywords to a title and abstract filter, and finally to a 10 keyword shorter query, alongside an automatic JSON mode fallback when tool calling fails.

Aligning the RL Reward Signal Designing a useful reward signal for our offline reinforcement learning pipeline was significantly harder than generating model outputs. A final answer can appear convincing even when its search query was weak, its evidence was misread, or its citation was fabricated. We had to move from outcome only scoring to rigorous process supervision. Capturing SEARCH, READ, and SYNTHESIZE rollouts required consistent trajectory schemas. Furthermore, DPO introduced the challenge that a preferred response must be better for the correct reason. We had to actively avoid rewarding verbosity by prioritizing strict evidence fidelity, NLI support, and non redundancy.

Infrastructure Isolation To guarantee demo stability while pushing technical boundaries, we had to mechanically isolate the experimental Qwen offline inference service from the stable GPT demonstration. Both backbones required separate clients and caches to ensure that a local GPU timeout or adapter failure could never break the primary workflow presented to the judges.

Evaluation Under Epistemic Uncertainty How do you evaluate an AI when it successfully finds a genuine societal flaw that the original human authors failed to report? Standard metrics penalize this as a false positive. We refused to treat unmatched but accurate limitations as wrong. Instead, we solved this epistemic paradox by splitting our evaluation. We measure Academic Recall to see how well the system extracts author admitted flaws. We also measure Societal Discovery Rate, which is a custom metric assessing the validity, grounding, and severity of previously undiscovered systemic risks. This separation allows us to accurately measure model performance and forms the defensible cornerstone of our Brief 6 pitch.

Accomplishments that we're proud of

We are immensely proud of delivering a fully functional, end to end prototype rather than a simple chatbot wrapper. Our system seamlessly handles everything from initial PDF ingestion to complex population impact simulation, complete with a RE AIM scorecard and a human in the loop audit log. We successfully engineered two distinct applications, PolicyGuard for regulatory deployment audits and CommunityGuard for local fit assessments, both running on the exact same underlying agent infrastructure and strict Pydantic contracts.

We are particularly proud that our reinforcement learning strategy successfully aligns the actual reasoning processes of our agents, not merely their writing style. We engineered a highly sophisticated offline alignment pipeline featuring multi hop rollout generation across our worker agents. Instead of relying on flawed outcome only scoring, we implemented rigorous step wise reward modeling. We assigned per hop process rewards for search query quality, evidence extraction, and final Limitation Unit synthesis. By integrating NLI support signals and evidence fidelity scoring, we constructed high quality preferred and rejected trajectories. This allowed us to train DPO aligned worker adapters that are mechanically forced to prioritize factual grounding over fluency, creating a fully inspectable connection between our model training objectives and responsible AI requirements.

Bridging the qualitative and quantitative divide stands as one of our most significant technical feats. Our deterministic Python simulator runs over five real public datasets, including CDC PLACES, Census ACS, SVI, CMS Hospital Compare, and HRSA HPSA. It executes Monte Carlo simulations with 200 uncertainty draws to produce mathematically sound 95 percent confidence intervals. Furthermore, every local data block in CommunityGuard now carries a dedicated narrative interpretation, translating abstract demographic mathematics into concrete implications for community residents.

We achieved unprecedented hallucination transparency. Every surfaced limitation displays its exact evidence span, source attribution, confidence score, and the specific agent that generated it, making the system's own epistemic uncertainty completely visible to the user. We even built a side by side comparison mode where judges can replay a standard zero shot model against our eleven agent pipeline on the same paper, visually demonstrating the critical evidence spans our agents recover that vanilla models completely miss.

To prove our system efficacy, we designed a custom two axis evaluation framework specifically for Brief 6. By splitting our metrics into Academic Recall and Societal Discovery Rate, we successfully quantified not just what the original authors admitted, but the critical systemic risks they omitted.

Finally, we ensured our architecture remains entirely pluggable. The exact same infrastructure runs seamlessly against cloud APIs or locally served open weight models like Qwen, proving our solution requires no paid tool advantage to protect public health.

What we learned

Responsible AI is primarily a systems design problem, not a model problem. A more capable backbone alone does not produce safer or more reliable decisions. True reliability comes from decomposing responsibilities across independent entities, strictly constraining outputs with structural schemas, and preserving absolute provenance through verbatim quotes. By validating evidence mechanically before seeking consensus, exposing epistemic uncertainty as a first class field, and reserving final authority for named human auditors, we build trust that a monolithic black box can never provide.

We learned that final answer evaluation is fundamentally insufficient for safety sensitive agents. A limitation can appear completely correct by pure accident despite poor retrieval logic or unsupported reasoning. Moving from outcome only scoring to rigorous step wise reward modeling revealed exactly where an agent trajectory failed. We learned that step wise rewards reveal the true health of a multi hop trajectory, making preference training significantly more meaningful. Furthermore, direct preference optimization quality depends entirely on the precision of the preference signal. If rewards prioritize fluency, the model simply becomes more persuasive at hallucinating. When we shifted our process rewards to prioritize evidence fidelity, contradiction control, and precise demographic subgroup specificity, the model transformed into a highly useful tool for high stakes deployment decisions.

Academic recall and societal validity are completely different problems. A system that perfectly reproduces author stated limitations still entirely misses the specific communities and regional populations most likely to be harmed by an AI deployment. Human authors rarely admit or foresee the structural infrastructure constraints or demographic equity gaps their models will cause. Brief 6 asks a much harder, more systemic question, and our custom discovery rate metric was engineered precisely to answer it.

Specialization consistently outperforms generalization when consensus matters. Eleven small, focused worker agents with non overlapping prompt boundaries produce uncorrelated errors. Because they are forbidden from cross talk, their individual failures are completely isolated. When two or more of these distinct agents independently flag the exact same concern, it creates a genuine consensus signal, something a single fine tuned model or a general purpose prompt cannot replicate.

Simulation results are far more credible as transparent, conditional scenarios than as blind predictions. Reporting that a deployment will result in 12400 missed diagnoses with a 95 percent confidence interval of 8200 to 17100, assuming a 20 percent efficacy drop in high vulnerability subgroup X, is scientifically defensible. Simply stating that the model predicts a specific number of missed diagnoses lacks academic integrity. Deterministic population mathematics must always be decoupled from stochastic language reasoning.

Finally, we learned that mechanical filters should always run before large language model judges. Pydantic validation, boilerplate blocklists, and evidence anchoring gates are computationally cheap and entirely deterministic. Running them first guarantees that expensive language model judging is only utilized on candidate critiques that have already cleared the necessary structural, lexical, and textual bars.

What's next for A multi agent for Research Limitation Generation

We will focus immediately on expanding our evaluation dataset across a wider array of medical specialties and rigorously benchmarking our DPO-aligned worker agents against base models, SFT-only baselines, and GPT-4o-mini. This will establish clear, empirical proof of our system's edge in high-stakes public health domains.

We will calibrate our efficacy loss assumptions through systematic comparisons against external clinical validation studies. Instead of treating variables as static, we will implement full sensitivity reporting for every parameter inside the Monte Carlo engine, proving exactly how changes in baseline assumptions impact long-term population health projections.

We intend to initiate a blinded clinical evaluation protocol. Medical clinicians and public health reviewers will compare our agent-generated findings against their own independent audits. This will allow us to measure precise inter-judge and human-agent agreement, while enabling us to calibrate our step-wise process reward weights against expert human feedback.

We are committed to finalizing the private Qwen and reinforcement learning deployment pipeline for hospital networks and government institutions legally unable to send sensitive manuscripts to external cloud APIs. While our DPO-aligned worker adapters and SFT-trained coordinator adapters are fully designed, we will continue to keep them structurally isolated from the primary GPT pathway so that a local hardware or adapter timeout cannot cause a cascading failure in the deployment environment.

We plan to transition our simulator from modeling only the cost of inaction to modeling candidate mitigation-strategy scenarios. The simulator will project the cost, feasibility, and risk-reduction score of specific interventions, such as subgroup-targeted re-validation, localized bilingual outreach, or a phased rollout starting exclusively in high-SVI tracts.

We will introduce multi-hop, citation-level entailment checking over full referenced papers rather than restricting our anti-hallucination guardrails to abstracts. To make this defense robust, we will introduce hard-negative rollouts containing plausible but unsupported claims into the training loop, forcing the models to build an even stronger resistance to verbosity and persuasion bias.

Finally, we will build an automated system to export polished policy briefs and standardized model cards. This will allow a CommunityGuard analysis to instantly convert into a structured, one-page document for a hospital governance committee, complete with a visible evidence trail, uncertainty-aware reward aggregations, and a verifiable human reviewer audit log attached.

Built With

  • american-community-survey-2022
  • and
  • and-hrsa-shortage-area-data.**-storage:-**local-parquet-files
  • and-ssh-tunneling.**-public-data-sources:-**pubmed-central-open-access
  • apache-parquet/pyarrow
  • bm25
  • cdc-places
  • cdc-social-vulnerability-index-2022
  • cms-hospital-data
  • csv-based-retrieval-corpus
  • deberta-nli
  • direct-preference-optimization-(dpo)
  • duckduckgo-search
  • faiss
  • hrsa
  • hugging-face-transformers
  • instructor
  • json
  • local-parquet-files
  • monte-carlo-simulation
  • numpy
  • nvidia-gpu-inference
  • openai-api-(gpt-4o-mini)
  • openai-python-sdk
  • openalex-api
  • pandas
  • pbs-gpu-scheduling
  • peft/lora
  • plotly
  • pubmed-central-open-access
  • pydantic
  • pymupdf
  • python
  • pytorch
  • qwen2.5-3b-instruct
  • shortage-area
  • ssh
  • streamlit
  • supervised-fine-tuning-(sft)
  • vllm
Share this project:

Updates