Inspiration
Every clinical AI team I've spoken to during my European HealthTech job search described the same blind spot: they had no idea what their agents were actually doing in production. Standard observability tools — Datadog, Jaeger, LangSmith — capture everything or nothing. For healthcare, "everything" means PHI leaking into trace payloads, and "nothing" means flying blind on a system that makes clinical decisions.The EU AI Act (Article 13) and FDA SaMD guidance both require audit trails for high-risk AI. Yet no open-source tool existed that understood the difference between a diagnostic_agent span and a generic function call. MedTrace-SDK was built to fix that.
What it does
MedTrace-SDK is a drop-in observability SDK for healthcare AI agent pipelines. In two lines of code, it instruments any LangGraph or LangChain agent with: HIPAA-aware PHI scrubbing — all 18 Safe Harbor identifiers are automatically redacted from trace payloads before export, using Microsoft Presidio with a medical NER extension Clinical metadata schema — spans are enriched with medtrace.agent.type, medtrace.clinical.domain, medtrace.risk.tier, and medtrace.safety.gate_triggered attributes Multi-agent trace correlation — W3C TraceContext propagation across agent boundaries so the full reasoning chain is captured as a single trace tree Trace replay engine — any stored trace can be re-executed deterministically (medtrace replay ) for debugging and regression testing Audit export — one CLI command generates a tamper-evident NDJSON archive with SHA-256 integrity hashes per span, formatted for regulatory submission The companion Next.js dashboard provides real-time pipeline health metrics, a trace explorer with domain/risk-tier filtering, and an interactive replay interface — all backed by a FastAPI + PostgreSQL server.
How we built it
The project is a monorepo with three packages:medtrace-sdk/ ├── packages/sdk/ ← Python SDK (pip install medtrace-sdk) ├── packages/server/ ← FastAPI trace ingestion + audit server └── packages/dashboard/ ← Next.js 14 live monitoring dashboardPython SDK (medtrace/) wraps OpenTelemetry's Python SDK with a healthcare-aware instrumentation layer. MedTracer.instrument_graph() patches LangGraph's CompiledGraph.invoke to emit spans automatically. The @tracer.trace_agent() decorator handles async agent nodes. The PHI scrubber runs synchronously in the export pipeline — if scrubbing fails, the span is dropped rather than exported raw.FastAPI server (packages/server/) exposes /traces/ingest, /traces/{id}/replay, and /audit/export endpoints backed by async SQLAlchemy + asyncpg on PostgreSQL. The audit export streams NDJSON via StreamingResponse to handle large time windows without memory issues.Next.js dashboard (packages/dashboard/) uses App Router with "use client" components for all real-time data. The animated background uses a custom WebGL2 GLSL shader (80-iteration raymarching loop) rendering a phosphorescent plasma effect, with a CSS glass3d backdrop-filter overlay creating the frosted glass UI treatment. Spline 3D scenes are loaded via @splinetool/runtime using useEffect + useRef to avoid SSR crashes.The entire stack is Dockerised with a single docker-compose.yml (PostgreSQL + FastAPI server + Grafana).
Challenges we ran into
PHI scrubbing in the hot path was the hardest problem. Presidio's NER models add ~80-150ms per call. Running them synchronously would make the SDK unusable in production. The solution was a defence-in-depth approach: Presidio runs as the primary detector, a regex fallback catches structured identifiers (SSNs, MRNs, phone numbers), and the whole scrub pipeline is configurable as async-off-path for non-safety-critical spans. LangGraph API churn — LangGraph's internal CompiledGraph structure changed three times during development. The instrumentation layer now uses a compatibility shim that inspects the graph object at runtime rather than importing internal classes directly. Vercel monorepo deployment consumed significant debugging time. The dashboard lives at packages/dashboard/ but Vercel's build pipeline kept resolving the wrong root, triggering CLI deploys that ignored the UI-configured Root Directory setting. The fix was removing vercel.json from the repo root entirely and letting the Vercel dashboard setting take sole authority. WebGL SSR — the phosphor shader component uses requestAnimationFrame and WebGL2RenderingContext, both browser-only APIs. Next.js's SSR would crash on import. Wrapping with dynamic(() => import(...), { ssr: false }) solved it, but required careful useEffect cleanup to prevent canvas context leaks on hot reload.
Accomplishments that we're proud of
PHI scrubbing recall of ~97% on the i2b2 de-identification benchmark with the combined Presidio + regex pipeline <5ms p99 instrumentation overhead per agent call — verified with the included benchmark suite A fully functional live dashboard deployed on Vercel with WebGL shader backgrounds and real-time polling A CLI (medtrace version, medtrace export, medtrace replay, medtrace status) that works end-to-end against a local server Regulatory alignment mapped explicitly to EU AI Act Article 13, FDA SaMD, and HIPAA Safe Harbor — documented in the PRD
What we learned
backdrop-filter: blur() is visually inert unless the element behind it has high luminance contrast — the glass3d effect only becomes visible when backed by a scene with vivid colour variance (learned the hard way after three shader iterations) OpenTelemetry's BatchSpanProcessor silently drops spans when the export queue fills — in a clinical context this is a patient safety issue, so we added queue depth monitoring as a default metric The EU AI Act's definition of "high-risk AI" in Annex III explicitly includes AI used in healthcare for clinical decision support — every clinical AI team in Europe needs an audit trail solution, and none of the existing OSS tools provide one
What's next for MedTrace-SDK
MedTrace × MedRedTeam-SDK integration — adversarial test runs from MedRedTeam-SDK automatically generate traces that MedTrace captures, creating a closed safety loop: attack → observe → fix Public clinical AI safety leaderboard (in collaboration with Urban Liebel) — models submit to MedEval-Bench, results are traced by MedTrace, safety rankings are published openly Grafana dashboard template — pre-built panels for clinical AI pipeline health (v0.2.0 milestone) LlamaIndex and CrewAI adapters — expanding beyond LangGraph/LangChain PyPI stable release — currently at 0.1.0-alpha; targeting 1.0.0 after external validation by 3 HealthTech engineering teams
Built With
- asyncpg
- css
- docker
- fastapi
- github-actions
- glsl
- langchain
- langgraph
- microsoft
- next.js
- opentelemetry
- postgresql
- presidio
- pydantic
- python
- recharts
- sqlalchemy
- tailwind
- typer
- typescript
- vercel
- webgl2
Log in or sign up for Devpost to join the conversation.