Specimen Integrity Black Box (SIBB)

A journey-trust layer for the clinical lab, built on Splunk with a human-gated AI agent.

Splunk Agentic Ops Hackathon submission.


1. The problem

Hospitals are excellent at asking "is the test result correct?" and almost blind to the question that comes before it: "can this sample even be trusted?"

A blood tube travels through a pneumatic transport network, fridges, queues, and manual handoffs before an analyzer ever touches it. Along the way it can be shaken past tolerance, stalled in a blower fault, rerouted, left in a deep receiving queue, or handed off without a custody scan. The analyzer still returns a clean, in-range number. Quality control still says PASS. The result flows to the doctor - and nobody ever checked the journey.

This pre-analytic blind spot is where the majority of avoidable lab errors are born. The telemetry that would expose it already exists, scattered across transport logs, station sensors, queue states, and operator scans. It is simply never correlated into a single verdict before the result is released.

SIBB closes that gap. It watches the journey, not the blood, and produces one decision: can this specimen's result still be trusted, or should a human intervene before it goes downstream?


2. What the product does

For every specimen journey, SIBB assembles the full cross-index story, scores it, and assigns a trust band:

Band Meaning Default action
NORMAL Journey integrity verified RELEASE (auto)
CONCERNING Custody/queue anomaly needs a human check HOLD - CONFIRM CUSTODY
CRITICAL Physical integrity likely compromised HOLD + REDRAW

Clean journeys release on their own. Only suspect ones surface to a human, in an operations console, with every claim backed by a real Splunk event. A reviewer approves exactly one bounded, reversible action, and that decision is written back into Splunk so the platform owns the audit trail.

Two design rules define the product:

  • It is never a diagnosis. SIBB makes no clinical claim about the patient or the result value. It only judges whether the journey can be trusted.
  • It does not cry wolf. A six-minute delay that matches a known 07:00 shift-change pattern is released, not escalated. Suppressing the false alarm is treated as half the job.

3. How SIBB handles a journey (end to end)

Specimen Integrity Black Box Architecture

  1. Discover the Splunk environment (indexes + metadata).
  2. Bounded SPL query for this one specimen across all journey indexes - never an unbounded scan.
  3. Pull the route baseline from a lookup (expected p50/p95 transit, scan window, known-benign context).
  4. Forecast expected transit with a dual path: a saved-search quantile forecaster, with a deterministic fallback so it never depends on a preview model.
  5. Reason over the cited evidence with a hosted model, which classifies the band and writes a rationale where every claim resolves to a real event ID.
  6. Human gate + writeback: nothing executes on its own; an approved decision is appended to journey_decisions.

4. Where we use Splunk

Splunk is the system of record, the query engine, the forecaster, the knowledge base, and the audit log - not just a dashboard. Every number SIBB reasons over comes out of Splunk.

4.1 Indexes (the journey, decomposed)

Index What it holds
specimen_journey Milestones per specimen: drawn, dispatched, received, etc., with route_id, carrier_id, dest_node
tube_telemetry Pneumatic carrier physics: transit_sec, shock_index, blower faults, reroutes, arrive/depart events
station_state Per-station environmental state (e.g. fridge door, temperature)
queue_state Receiving-queue depth and oldest-wait times
operator_actions Manual scans and handoffs (and their absence)
journey_decisions Writeback target - every agent verdict, cited, with reviewer and reasoning source

4.2 HEC ingest + typed parsing

Synthetic-but-physical telemetry is ingested over HTTP Event Collector with token auth. props.conf types the sibb:* sourcetypes for clean, index-time-correct JSON:

  • INDEXED_EXTRACTIONS = json
  • KV_MODE = json
  • TIME_PREFIX = "ts":" with a matching TIME_FORMAT

4.3 Knowledge objects

Object Type Role
route_baselines.csv Lookup (transforms.conf) Per-route p50/p95 transit, scan window, and known_benign_context (e.g. shift_change_0700) that drives false-alarm suppression
baseline_recent_by_route Saved search (cron */10) Recent transit quantiles per line for fallback forecasting (stats median()/perc95() ... eval forecast_source="stl_quantile_fallback")
candidate_integrity_reviews Saved search (cron */5) Surfaces compound-risk journeys (join telemetry + lookup baseline; flags time_risk/shock_risk) for the agent to pick up
journey(1) Macro Cross-index assembly of a single specimen's full journey
journey_decision_writeback_fields Macro Canonical field set for the writeback contract

4.4 SPL we actually run

Bounded, single-specimen queries - stats median()/perc95() for baselines, join to correlate carrier telemetry with journey milestones, lookup for route baselines, eval/where for risk flags, table/sort/dedup for shaping. No unbounded scans.

4.5 REST management API + least-privilege tokens

The agent talks to Splunk over the REST management API on :8089 with a scoped bearer token (minted via authorization/tokens, short TTL, single audience). It uses:

Endpoint Use
search/jobs/export Run bounded SPL, stream results
data/indexes Discover environment
saved/searches Read the forecaster + candidate reviews
authorization/tokens Mint least-privilege agent tokens
HEC services/collector/event Writeback to journey_decisions

4.6 Resilient dual-path forecasting

Transit forecasting never hard-depends on any one engine. The primary path uses a Splunk saved-search quantile forecaster; if that is unavailable, a deterministic Splunk-native fallback (stl_quantile_fallback) computes expected/p95 from recent history. The forecast source is recorded on every decision, so the audit trail always shows how the number was produced.

4.7 Writeback - closing the loop in Splunk

When a human approves, the decision (band, score, recommended action, rationale, evidence event IDs, reasoning source, reviewer hash) is written back via HEC into journey_decisions. The console reads it straight out of Splunk - the platform, not the app, owns the record.


5. AI features

The hosted model is the reasoning brain, but it is fenced in by hard guardrails so it can never invent facts.

5.1 Real multi-step agent over MCP

SIBB is not a single prompt. It is a multi-step agent whose only path to Splunk is a token-authenticated MCP server exposing a least-privilege tool surface:

MCP tool What it does
list_indexes Discover available indexes
get_index_metadata Index field/structure metadata
run_search Execute a bounded SPL query
get_lookup Read a knowledge-object lookup (e.g. route baselines)
get_saved_search_result Read the forecaster / candidate-review output
forecast_transit Dual-path expected-transit forecast
write_decision Human-gated writeback to journey_decisions

5.2 Runtime hosted-model reasoning - Gemini 3.5 Flash on Vertex AI

The classification step is a live Gemini 3.5 Flash call on Vertex AI at runtime (reasoning_source = hosted_model:vertex:gemini-3.5-flash), authenticated with a service-account token. The model receives only Splunk-retrieved evidence and returns a trust band, recommended action, and rationale.

5.3 The citation contract (the core safety feature)

  • Numbers come from Splunk, not the model. Every quantitative claim (transit seconds, queue depth, shock index) is an SPL result.
  • The model classifies and cites. Each evidence line carries a real event_id.
  • Uncited claims are rejected. Any assertion the model cannot pin to an event is dropped, and a rule_fallback guardrail produces a deterministic verdict if the model output fails validation.

This is what makes the AI trustworthy in a clinical context: it reasons, but it cannot hallucinate a fact into the audit trail.

5.4 Gemini-written forecast analysis

Beyond per-specimen judgment, the model writes a live forecast analysis over the route baselines vs. observed transit (correlated across tube_telemetry and specimen_journey by specimen_id). In testing it correctly reported the network as stable on average while flagging the severe tail-risk outlier on Line B hiding under a healthy median - analysis, not a static threshold.

5.5 Pluggable and resilient

The provider layer is swappable and every model path has a deterministic fallback, so a preview-model outage degrades gracefully to a rule-based verdict rather than failing the journey.


6. Technical implementation

6.1 Architecture at a glance

Layer Implementation
Telemetry generator Dependency-free Python emitting physical anomalies (no labels the model can read)
Ingest HEC (token auth) into 6 indexes; props.conf-typed sibb:* sourcetypes
Knowledge Lookup + 2 saved searches + 2 macros (transforms/savedsearches/props/macros.conf)
Access REST management API on :8089 with scoped bearer tokens
Tool layer MCP server (stdio JSON-RPC, bearer auth, 7 least-privilege tools)
Agent Multi-step MCP loop (run_investigation.py)
Reasoning Gemini 3.5 Flash on Vertex AI + citation contract + rule fallback
Live bridge stdlib http.server serving the frontend same-origin + JSON API
Frontend Vanilla JS / CSS console (TRUST PULSE, queue, journey detail, agent console, ledger, forecasts)

6.2 Live bridge API

The console reads live Splunk through a dependency-free bridge (bridge/serve.py):

Method + path Purpose
GET /api/data Aggregate: pulse, review queue, contrast case, decision ledger, forecast - built from Splunk reads
GET /api/forecast-report Gemini forecast analysis grounded on baseline vs. observed transit (cached, billed call)
POST /api/investigate Run the real MCP + Gemini loop for one specimen (no writeback)
POST /api/writeback Same loop, then append the approved decision to journey_decisions

With no Splunk configured the bridge returns 503 and the frontend falls back to built-in mock data, so the UI never breaks on stage.

6.3 The agent trace (one investigation)

Step Tool Output (example: SPX-7F3A-9C)
1 list_indexes / get_index_metadata 6 indexes discovered
2 run_search bounded SPL for the specimen
3 get_lookup route baseline LINE_B_OR3_REC2 (p95 = 310s)
4 forecast_transit / get_saved_search_result source = stl_quantile_fallback
5 reason_over_cited_evidence band = CRITICAL, action = HOLD + REDRAW, gemini-3.5-flash
6 write_decision HEC 200 -> journey_decisions

6.4 Worked example - the verdict that matters

SPX-7F3A-9C, route LINE_B_OR3_REC2:

  • Transit 2284s vs. a route p95 of 310s - 7.4x the ceiling [tube_telemetry/0x00009]
  • Blower fault -> reroute -> blower fault -> reroute during the carrier journey [tube_telemetry/0x00007]
  • Shock index 0.81, above the handling-risk threshold [tube_telemetry/0x00007]
  • Receiving queue depth 12, oldest wait 380s [queue_state/0x0000D]

Verdict: CRITICAL (21/100) -> HOLD + REDRAW. Downstream QC said PASS. The data looked fine. The journey did not.

And the counter-example, SPX-5D20-1A (route LINE_A_WARD7_REC1): a queue delay that matched the known shift_change_0700 benign context -> NORMAL (92/100) -> RELEASE. The false alarm a naive threshold would have fired.


7. Scope and guardrails

  • Synthetic data, real architecture. All telemetry is generated; anomalies are injected as physics with no label the model can read. Swapping in real hospital feeds is a source change, not an architecture change.
  • Never a diagnosis - only a journey-trust signal.
  • Human-in-the-loop - no action executes without an approval, and every action is bounded and reversible.
  • Cited or rejected - the model can reason but cannot introduce an unverifiable fact.

8. Tech stack

Concern Choice
Platform of record Splunk (indexes, HEC, REST, SPL, saved searches, lookups, macros)
Tool protocol Model Context Protocol (MCP), stdio JSON-RPC, bearer auth
Runtime reasoning Gemini 3.5 Flash on Vertex AI
Agent + bridge Python standard library only (zero runtime dependencies)
Frontend Vanilla HTML / CSS / JS, served same-origin by the bridge
Auth Splunk scoped tokens; Vertex service-account credentials

Built With

Share this project:

Updates