Specimen Integrity Black Box (SIBB)

A journey-trust layer for the clinical lab, built on Splunk with a human-gated AI agent.

Splunk Agentic Ops Hackathon submission.

1. The problem

Hospitals are excellent at asking "is the test result correct?" and almost blind to the question that comes before it: "can this sample even be trusted?"

A blood tube travels through a pneumatic transport network, fridges, queues, and manual handoffs before an analyzer ever touches it. Along the way it can be shaken past tolerance, stalled in a blower fault, rerouted, left in a deep receiving queue, or handed off without a custody scan. The analyzer still returns a clean, in-range number. Quality control still says PASS. The result flows to the doctor - and nobody ever checked the journey.

This pre-analytic blind spot is where the majority of avoidable lab errors are born. The telemetry that would expose it already exists, scattered across transport logs, station sensors, queue states, and operator scans. It is simply never correlated into a single verdict before the result is released.

SIBB closes that gap. It watches the journey, not the blood, and produces one decision: can this specimen's result still be trusted, or should a human intervene before it goes downstream?

2. What the product does

For every specimen journey, SIBB assembles the full cross-index story, scores it, and assigns a trust band:

Band	Meaning	Default action
`NORMAL`	Journey integrity verified	RELEASE (auto)
`CONCERNING`	Custody/queue anomaly needs a human check	HOLD - CONFIRM CUSTODY
`CRITICAL`	Physical integrity likely compromised	HOLD + REDRAW

Clean journeys release on their own. Only suspect ones surface to a human, in an operations console, with every claim backed by a real Splunk event. A reviewer approves exactly one bounded, reversible action, and that decision is written back into Splunk so the platform owns the audit trail.

Two design rules define the product:

It is never a diagnosis. SIBB makes no clinical claim about the patient or the result value. It only judges whether the journey can be trusted.
It does not cry wolf. A six-minute delay that matches a known 07:00 shift-change pattern is released, not escalated. Suppressing the false alarm is treated as half the job.

3. How SIBB handles a journey (end to end)

Specimen Integrity Black Box Architecture

Discover the Splunk environment (indexes + metadata).
Bounded SPL query for this one specimen across all journey indexes - never an unbounded scan.
Pull the route baseline from a lookup (expected p50/p95 transit, scan window, known-benign context).
Forecast expected transit with a dual path: a saved-search quantile forecaster, with a deterministic fallback so it never depends on a preview model.
Reason over the cited evidence with a hosted model, which classifies the band and writes a rationale where every claim resolves to a real event ID.
Human gate + writeback: nothing executes on its own; an approved decision is appended to journey_decisions.

4. Where we use Splunk

Splunk is the system of record, the query engine, the forecaster, the knowledge base, and the audit log - not just a dashboard. Every number SIBB reasons over comes out of Splunk.

4.1 Indexes (the journey, decomposed)

Index	What it holds
`specimen_journey`	Milestones per specimen: drawn, dispatched, received, etc., with `route_id`, `carrier_id`, `dest_node`
`tube_telemetry`	Pneumatic carrier physics: `transit_sec`, `shock_index`, blower faults, reroutes, arrive/depart events
`station_state`	Per-station environmental state (e.g. fridge door, temperature)
`queue_state`	Receiving-queue depth and oldest-wait times
`operator_actions`	Manual scans and handoffs (and their absence)
`journey_decisions`	Writeback target - every agent verdict, cited, with reviewer and reasoning source

4.2 HEC ingest + typed parsing

Synthetic-but-physical telemetry is ingested over HTTP Event Collector with token auth. props.conf types the sibb:* sourcetypes for clean, index-time-correct JSON:

INDEXED_EXTRACTIONS = json
KV_MODE = json
TIME_PREFIX = "ts":" with a matching TIME_FORMAT

4.3 Knowledge objects

Object	Type	Role
`route_baselines.csv`	Lookup (`transforms.conf`)	Per-route p50/p95 transit, scan window, and `known_benign_context` (e.g. `shift_change_0700`) that drives false-alarm suppression
`baseline_recent_by_route`	Saved search (cron `*/10`)	Recent transit quantiles per line for fallback forecasting (`stats median()/perc95() ... eval forecast_source="stl_quantile_fallback"`)
`candidate_integrity_reviews`	Saved search (cron `*/5`)	Surfaces compound-risk journeys (`join` telemetry + `lookup` baseline; flags `time_risk`/`shock_risk`) for the agent to pick up
`journey(1)`	Macro	Cross-index assembly of a single specimen's full journey
`journey_decision_writeback_fields`	Macro	Canonical field set for the writeback contract

4.4 SPL we actually run

Bounded, single-specimen queries - stats median()/perc95() for baselines, join to correlate carrier telemetry with journey milestones, lookup for route baselines, eval/where for risk flags, table/sort/dedup for shaping. No unbounded scans.

4.5 REST management API + least-privilege tokens

The agent talks to Splunk over the REST management API on :8089 with a scoped bearer token (minted via authorization/tokens, short TTL, single audience). It uses:

Endpoint	Use
`search/jobs/export`	Run bounded SPL, stream results
`data/indexes`	Discover environment
`saved/searches`	Read the forecaster + candidate reviews
`authorization/tokens`	Mint least-privilege agent tokens
HEC `services/collector/event`	Writeback to `journey_decisions`

4.6 Resilient dual-path forecasting

Transit forecasting never hard-depends on any one engine. The primary path uses a Splunk saved-search quantile forecaster; if that is unavailable, a deterministic Splunk-native fallback (stl_quantile_fallback) computes expected/p95 from recent history. The forecast source is recorded on every decision, so the audit trail always shows how the number was produced.

4.7 Writeback - closing the loop in Splunk

When a human approves, the decision (band, score, recommended action, rationale, evidence event IDs, reasoning source, reviewer hash) is written back via HEC into journey_decisions. The console reads it straight out of Splunk - the platform, not the app, owns the record.

5. AI features

The hosted model is the reasoning brain, but it is fenced in by hard guardrails so it can never invent facts.

5.1 Real multi-step agent over MCP

SIBB is not a single prompt. It is a multi-step agent whose only path to Splunk is a token-authenticated MCP server exposing a least-privilege tool surface:

MCP tool	What it does
`list_indexes`	Discover available indexes
`get_index_metadata`	Index field/structure metadata
`run_search`	Execute a bounded SPL query
`get_lookup`	Read a knowledge-object lookup (e.g. route baselines)
`get_saved_search_result`	Read the forecaster / candidate-review output
`forecast_transit`	Dual-path expected-transit forecast
`write_decision`	Human-gated writeback to `journey_decisions`

5.2 Runtime hosted-model reasoning - Gemini 3.5 Flash on Vertex AI

The classification step is a live Gemini 3.5 Flash call on Vertex AI at runtime (reasoning_source = hosted_model:vertex:gemini-3.5-flash), authenticated with a service-account token. The model receives only Splunk-retrieved evidence and returns a trust band, recommended action, and rationale.

5.3 The citation contract (the core safety feature)

Numbers come from Splunk, not the model. Every quantitative claim (transit seconds, queue depth, shock index) is an SPL result.
The model classifies and cites. Each evidence line carries a real event_id.
Uncited claims are rejected. Any assertion the model cannot pin to an event is dropped, and a rule_fallback guardrail produces a deterministic verdict if the model output fails validation.

This is what makes the AI trustworthy in a clinical context: it reasons, but it cannot hallucinate a fact into the audit trail.

5.4 Gemini-written forecast analysis

Beyond per-specimen judgment, the model writes a live forecast analysis over the route baselines vs. observed transit (correlated across tube_telemetry and specimen_journey by specimen_id). In testing it correctly reported the network as stable on average while flagging the severe tail-risk outlier on Line B hiding under a healthy median - analysis, not a static threshold.

5.5 Pluggable and resilient

The provider layer is swappable and every model path has a deterministic fallback, so a preview-model outage degrades gracefully to a rule-based verdict rather than failing the journey.

6. Technical implementation

6.1 Architecture at a glance

Layer	Implementation
Telemetry generator	Dependency-free Python emitting physical anomalies (no labels the model can read)
Ingest	HEC (token auth) into 6 indexes; `props.conf`-typed `sibb:*` sourcetypes
Knowledge	Lookup + 2 saved searches + 2 macros (`transforms`/`savedsearches`/`props`/`macros`.conf)
Access	REST management API on `:8089` with scoped bearer tokens
Tool layer	MCP server (stdio JSON-RPC, bearer auth, 7 least-privilege tools)
Agent	Multi-step MCP loop (`run_investigation.py`)
Reasoning	Gemini 3.5 Flash on Vertex AI + citation contract + rule fallback
Live bridge	stdlib `http.server` serving the frontend same-origin + JSON API
Frontend	Vanilla JS / CSS console (TRUST PULSE, queue, journey detail, agent console, ledger, forecasts)

6.2 Live bridge API

The console reads live Splunk through a dependency-free bridge (bridge/serve.py):

Method + path	Purpose
`GET /api/data`	Aggregate: pulse, review queue, contrast case, decision ledger, forecast - built from Splunk reads
`GET /api/forecast-report`	Gemini forecast analysis grounded on baseline vs. observed transit (cached, billed call)
`POST /api/investigate`	Run the real MCP + Gemini loop for one specimen (no writeback)
`POST /api/writeback`	Same loop, then append the approved decision to `journey_decisions`

With no Splunk configured the bridge returns 503 and the frontend falls back to built-in mock data, so the UI never breaks on stage.

6.3 The agent trace (one investigation)

Step	Tool	Output (example: `SPX-7F3A-9C`)
1	`list_indexes` / `get_index_metadata`	6 indexes discovered
2	`run_search`	bounded SPL for the specimen
3	`get_lookup`	route baseline `LINE_B_OR3_REC2` (p95 = 310s)
4	`forecast_transit` / `get_saved_search_result`	source = `stl_quantile_fallback`
5	`reason_over_cited_evidence`	band = CRITICAL, action = HOLD + REDRAW, `gemini-3.5-flash`
6	`write_decision`	HEC 200 -> `journey_decisions`

6.4 Worked example - the verdict that matters

SPX-7F3A-9C, route LINE_B_OR3_REC2:

Transit 2284s vs. a route p95 of 310s - 7.4x the ceiling [tube_telemetry/0x00009]
Blower fault -> reroute -> blower fault -> reroute during the carrier journey [tube_telemetry/0x00007]
Shock index 0.81, above the handling-risk threshold [tube_telemetry/0x00007]
Receiving queue depth 12, oldest wait 380s [queue_state/0x0000D]

Verdict: CRITICAL (21/100) -> HOLD + REDRAW. Downstream QC said PASS. The data looked fine. The journey did not.

And the counter-example, SPX-5D20-1A (route LINE_A_WARD7_REC1): a queue delay that matched the known shift_change_0700 benign context -> NORMAL (92/100) -> RELEASE. The false alarm a naive threshold would have fired.

7. Scope and guardrails

Synthetic data, real architecture. All telemetry is generated; anomalies are injected as physics with no label the model can read. Swapping in real hospital feeds is a source change, not an architecture change.
Never a diagnosis - only a journey-trust signal.
Human-in-the-loop - no action executes without an approval, and every action is bounded and reversible.
Cited or rejected - the model can reason but cannot introduce an unverifiable fact.

8. Tech stack

Concern	Choice
Platform of record	Splunk (indexes, HEC, REST, SPL, saved searches, lookups, macros)
Tool protocol	Model Context Protocol (MCP), stdio JSON-RPC, bearer auth
Runtime reasoning	Gemini 3.5 Flash on Vertex AI
Agent + bridge	Python standard library only (zero runtime dependencies)
Frontend	Vanilla HTML / CSS / JS, served same-origin by the bridge
Auth	Splunk scoped tokens; Vertex service-account credentials

MED-BLACKBOX