Inspiration

Every large pharma company sits on a graveyard of shelved drug programs molecules that cleared Phase I safety, showed real biological activity, then got killed for reasons that had almost nothing to do with the chemistry. Portfolio pivots. A bad readout in the wrong indication. A merger. A patent cliff. A CFO.

We kept running into the same statistic: roughly 90% of drugs that enter clinical trials never reach patients, and the industry spends north of $200B/year developing new ones while the shelved assets, which already have human safety data, sit untouched. The scarce resource in drug repurposing isn't molecules or data; it's reasoning capacity. No human team can read every trial on ClinicalTrials.gov, cross-reference PubMed, weigh mechanistic plausibility, estimate trial cost, and do it across a whole portfolio.

That is an agent problem. So we built Lazarus an autonomous clinical R&D swarm that turns that graveyard into a live, rank-ordered opportunity surface.

What it does

Lazarus is a full-stack control plane for drug rescue. Given a shelved asset, it:

  1. Discovers candidate assets from ClinicalTrials.gov, PubMed, and openFDA.
  2. Reasons over each one with a DAG of nine specialized agents Advocate → Skeptic → Evidence Curator → Parallel Evidence Branches → Trial Strategist → Effort + Impact estimators → Judge → Follow-up Assistant streamed live to the UI over WebSocket.
  3. Ranks the portfolio across a composite score:

$$ \text{Score} \;=\; w_c \cdot C \;+\; w_i \cdot I \;-\; w_e \cdot E \;-\; w_h \cdot H $$

where C is final confidence, I is commercial impact, E is execution effort, H is human-review drag, and w* are tunable weights.

  1. Generates an executive-grade PDF blueprint (Jinja + WeasyPrint) for the winning hypothesis.
  2. Notifies stakeholders over an optional iMessage/Spectrum bridge, and can be driven end-to-end from a conversational OpenClaw desktop agent.

Operators get a dual UI an operator dashboard for portfolio-level triage, and a dedicated reasoning lab (/agent-trace) where every agent's input, output, score, and citations are visible in real time.

How we built it

Lazarus is architected as a live control plane, not a notebook.

Backend. FastAPI + SQLAlchemy 2 + Pydantic v2. Every run is persisted as agent_runs + agent_steps rows and simultaneously pushed to the client over /runs/{id}/stream the UI renders what actually happened, not what we guess happened.

Reasoning engine. An explicit orchestration service wires agents as a typed DAG with Pydantic contracts instead of a free-form ReAct loop. Each role runs on the model best suited to it:

Role Model
Advocate / Judge OpenAI gpt-4o
Skeptic OpenAI gpt-4o-mini
Evidence Curator / Parallel Branches Google Gemini 2.5-flash
Trial Strategist MBZUAI K2-Think v2
Effort / Impact Deterministic + LLM assist

Dual store. Postgres holds operational truth (runs, hypotheses, blueprints, reviews). Neo4j holds the biological knowledge graph Drug ↔ Target ↔ Disease ↔ Evidence ↔ Hypothesis ↔ Strategy rendered in the UI via Cytoscape.

Frontend. React 18 + Vite 7 + Framer Motion + Three.js/R3F globe + Cytoscape + D3, with a dedicated reasoning lab so judges can literally watch agents think.

Agentic surfaces. An optional Photon/Spectrum iMessage bridge and an OpenClaw skill pack let operators kick off runs and pull blueprints from chat or a desktop agent.

Infra. Docker Compose for Postgres + Neo4j + Redis locally; Render (backend) + Vercel (frontend) in production.

Challenges we ran into

  • Streaming reasoning over WebSocket while persisting to Postgres without blocking the orchestrator turned into a surprisingly subtle async problem. We ended up with a persist-then-broadcast pattern and a polling fallback (/runs/{id}/trace) for flaky networks.
  • Cross-store consistency. Writing a hypothesis to Postgres and projecting it into Neo4j without creating orphan nodes during partial failures required idempotent upserts on the graph side and runtime migrations (db.apply_runtime_migrations()) on the SQL side.
  • Hallucinated PubMed IDs were rampant in early Advocate outputs. We solved it by having the Skeptic fetch the cited PMIDs and fail the step if they didn't resolve.
  • Live-demo reliability. Venue Wi-Fi is a threat model. We added LAZARUS_DISCOVERY_DEMO_CACHE=true to serve a canned ClinicalTrials.gov-shaped payload, so the demo never depends on a network we don't control.
  • PDF generation under load. WeasyPrint is gorgeous but slow; we moved blueprint rendering behind /generate-blueprint/async with a Postgres-backed job record so the UI stays snappy.
  • Human-in-the-loop gating. Low-confidence or high-disagreement hypotheses needed to be blocked from blueprinting, not just flagged. Modeling this as a first-class HumanReview table with an escalation dashboard was the most boring and most important feature in the app.
  • Routing across three LLM providers (OpenAI, Gemini, K2-Think) meant three auth models, three rate-limit regimes, three failure modes. Deterministic fallbacks at the agent layer kept the pipeline coherent when any one of them flaked.

Accomplishments that we're proud of

  • A reasoning trace you can actually watch. Every agent step is persisted and pushed over WebSocket the UI isn't a progress spinner pretending to be intelligence; it's a live, inspectable DAG.
  • Nine-agent orchestration that doesn't degenerate. Advocate, Skeptic, Curator, Parallel Evidence Branches, Trial Strategist, Effort, Impact, Judge, and Follow-up all produce structured JSON under Pydantic contracts and every one has a deterministic fallback, so the pipeline never dead-ends.
  • Real human-in-the-loop, not theater. Low-confidence hypotheses are gated behind a HumanReview queue before they can be blueprinted. Reviewers have a first-class dashboard.
  • A Postgres + Neo4j split that earns its complexity. Operational truth stays ACID; the knowledge graph stays traversal-native. We resisted the temptation to collapse them.
  • An executive-grade PDF blueprint rendered from Jinja + WeasyPrint with citations, mechanistic rationale, trial plan, and effort/impact economics the artifact a BD team would actually take into a meeting.
  • Agentic surfaces beyond the browser. Lazarus can be driven from iMessage via the Photon/Spectrum bridge, or from a desktop OpenClaw agent, using the same token-gated API.
  • Offline demo mode. LAZARUS_DISCOVERY_DEMO_CACHE=true means the demo runs even if the venue Wi-Fi doesn't.
  • Shipped in 36 hours. A control plane, a graph, a reasoning lab, a PDF generator, an iMessage bridge, and an OpenClaw skill pack all integrated, all working.

What we learned

  • Explicit orchestration beats ReAct for auditable science. A typed agent DAG cost us some flexibility but bought reliable, citable traces which is the product in a regulated domain.
  • LLM pluralism is a real architecture decision. Routing by role GPT-4o for reasoning, Gemini Flash for cheap-and-fast evidence work, K2-Think for long-horizon trial planning produced dramatically better outputs than a monoculture.
  • Hallucinated citations are the single biggest failure mode in scientific agents, and they have to be caught before the Judge, not after.
  • Streaming is a first-class feature, not a UX flourish. Turning "trust me, the model is thinking" into a live, inspectable trace is what actually convinces a domain expert.
  • Determinism is a demo survival skill. Fallbacks at the agent layer are the difference between a confident demo and a cold sweat.
  • Human-in-the-loop is an architectural choice, not a disclaimer. If you want your agents taken seriously by scientists, you have to build the review workflow into the data model on day one.

What's next for Lazarus

  • Redis-backed Celery / RQ workers for long-running async runs, replacing the current threaded executor so the pipeline can scale to portfolio-wide scans.
  • Vector memory over AssetMemory and RunMemory so agents can retrieve across prior runs, building institutional knowledge over time.
  • Graph-native ranking using Neo4j GDS portfolio-scale similarity search over the Drug / Target / Disease graph to surface non-obvious rescue candidates.
  • Formal evals per agent hallucination rate, citation grounding, skeptic precision, judge calibration so we can tune each role independently rather than eyeballing outputs.
  • First-class auth and multi-tenancy with row-level security, so multiple pharma teams can run isolated portfolios on the same deployment.
  • Wet-lab integration. Pushing the top-ranked hypotheses into assay planning tools to close the loop between in silico reasoning and bench validation.
  • Regulatory-aware blueprints. Extending the PDF generator to produce FDA 505(b)(2) and orphan-designation-ready packets directly from the hypothesis and evidence set.
  • Productionizing the agent surface. Hardening the Photon/Spectrum and OpenClaw integrations with proper audit logging so a BD lead can legitimately run Lazarus from their phone.

Built With

Share this project:

Updates