Inspiration
It's 3am, and your iphone goes off. A saved search someone wrote two years ago says payment auth is failing, and you spend the next twenty minutes proving it lied. An upstream team renamed a field, your detection started reading a dead one, and payment auth kept working the whole time.
That's schema drift: the slow, silent way a field gets renamed, a sourcetype's volume halves, a forwarder stutters, and your brittle detection starts measuring the wrong thing. Drift causes more than half the pages that wake an on-call SRE. No tool answers the question that decides whether you escalate or go back to sleep: did my measurement instrument break, or the thing it measures?
Every other tool summarizes the alert. None of them ask that question. How do you tell a change in the world apart from a change in your view of the world, fast enough to matter at 3am?
What it does
Driftwood watches the data feeding your detections.
- Profiles the SHAPE of every detection's data. For each saved search it stores a shape-fingerprint: the field set, each field's cardinality (
dc()) and null-rate, and the forecasted event-volume per sourcetype. The fingerprint is the detection's instrument calibration. - Diffs live-vs-baseline the instant an alert fires. A Gemini agent re-profiles the alert's own window over the Splunk 443 REST proxy and diffs it against the stored fingerprint and a volume forecast, using
tstats,fieldsummary, andstats dc(), live against the trailing baseline. - Classifies the page: noise or news. NOISE = a depended-on field disappeared and its cardinality migrated to a new field, but volume sits inside its forecast band (your instrument drifted). NEWS = the field set is intact but volume fell far below its band (a real incident).
- Names the exact drifted field. It reports the specifics: "
statusdisappeared, and new fieldhttp_status(dc()=14) carries its exact cardinality, so your detection is reading a renamed field." - Writes the verdict back into Splunk. Verdict, drifted field, and forecast residual go into
index=driftwood_verdictsas a structured event, with the re-runnable SPL attached, so the call stays auditable.
The money shot: same alert, opposite verdict, your choice. In the bundled Drift Lab you control which break happened. Rename status → http_status, the brittle search fires, and Driftwood returns NOISE (cardinality match, volume in-band). Undo it and instead kill 80% of the traffic; the same search fires the same page, and Driftwood returns NEWS (field set intact, volume ~80% below band). Only the live field-set/cardinality diff against the live volume residual separates the two, so a scripted demo can't reproduce it. Only the break is seeded; the verdict is always computed live.
How I built it
The console: Next.js (App Router, React 19, TypeScript) on Vercel. Six calm screens (The Shore, The Shape Diff, the Verdict Card, the Fingerprint Library, the Drift Lab, and Settings) in cream-and-ink, no red anywhere, with monospace numerals for every live figure. The server routes run on the Node runtime; the engine is server-only.
The Splunk engine, over the 443 web REST proxy. Splunk Cloud's mgmt port and ACS weren't reachable on the trial, so the whole client talks to the 443 web REST proxy (session login, then /services/search/jobs, /services/receivers/simple, /services/data/indexes). SPL computes every number on real data:
- Field shape:
fieldsummary+stats dc()for the field set, cardinality and null-rate, over the alert window vs the trailing baseline. - Volume:
| tstats count by _time span=1mper sourcetype. - Forecast band:
| predictover the per-minute baseline series, real Splunk ML.
The agent's brain: Google Gemini via Vertex AI (gemini-flash-latest). Gemini's job is narrow and load-bearing: take the SPL-computed numbers and name the drifted field + write the one-sentence verdict. It mints its own OAuth token from a service-account key with google-auth-library, with no gcloud at runtime; on Vercel the key rides in as an env var. The NOISE/NEWS label comes from a deterministic gate (classify()) over two real measurements, the field-set/cardinality diff and the volume forecast residual. Gemini reasons over the numbers and can never flip the label. If Gemini is unreachable, a deterministic templated sentence runs and the verdict is identical.
The three Splunk AI primitives, wired with honest seams. The trial didn't have the optional AI apps installed and I couldn't self-service them, so each one sits behind an env/feature seam that activates when present and falls back honestly. The mechanic, the money shot, and the verdict stay unchanged; Driftwood just earns one fewer bonus:
- Splunk MCP Server (app 7931): if
/services/mcpanswers, the agent routes its Splunk reads through it; otherwise it uses the 443 proxy directly. I also shipped a tiny in-repo MCP server (mcp/server.mjs, JSON-RPC over stdio) that wraps the client assplunk_search/splunk_fieldsummary/splunk_volumetools, so "an agent over MCP" is real either way. - Cisco Deep Time Series Hosted Model: if
HOSTED_MODEL_URLis set it serves as the volume oracle; otherwise the SPL-native| predictband runs as genuinely computed Splunk ML on your real series. - AI Assistant for SPL: Gemini drafts SPL, then I execute it and prove it returned rows, so unverified SPL never ships.
Deployed and verified live. The whole thing runs at driftwood-splunk.vercel.app, reaching Splunk over 443 and Vertex over HTTPS, with secrets server-side only. I drove both money-shot arms on the deployed URL as a stranger would: rename gave NOISE (status → http_status, dc 14 ≡ 14, volume in-band), drop gave NEWS (set intact, ~−81% below band), both written to index=driftwood_verdicts.
Challenges I ran into
- Telling a rename from a coincidence. A field disappearing is too weak a signal, since anything can vanish for a minute. The real signal is a cardinality match: the gone field's
dc()migrating onto a newly-appeared field. I gate it on a configurable match threshold (default 0.98) so a near-exactdc()match counts as a rename, and I only let established baseline fields (present in a majority of baseline events) "go gone," which stops an earlier experiment's residue from masking the diff. - Forecasting a band when the alert window is empty by design. The Drift Lab leaves the alert window empty in the baseline so the live diff stays honest, which means
| predictcan't forecast straight across it. I run the prediction over the dense recent per-minute series up to the window edge, derive the expected rate + spread from that, and project the band across the window length. Volume-in-band vs volume-below-band then falls out cleanly. - Serverless has no key file and no shared memory. Vercel can't mount a service-account JSON, so the auth layer reads the raw SA JSON from an env var (
VERTEX_SA_JSON) and mints tokens from it. Because the in-process fingerprint store doesn't survive across serverless isolates, the loop transparently re-captures the baseline fingerprint from the calm seeded feed when it isn't in memory, and the verdict stays real. - Keeping each demo run from bleeding into the next. Every break arm tags its events with a unique per-run
source, so the live alert window scopes to exactly the break you just triggered and prior experiments in the same index never dilute the diff. It stays 100% real Splunk data and only isolates the run.
Accomplishments that I'm proud of
- A verdict you can check. Every call ships the exact re-runnable SPL; paste it into your own search bar and watch it return the same rows. 0 verdicts run without a re-runnable search behind them.
- The decision stays computed. NOISE/NEWS is a deterministic gate over SPL-computed numbers, and the LLM only narrates. The verdict is pinned to the stranger's own data shape.
- Alert-to-verdict in ~2 seconds end to end on the live deployment, with drifted-field localization down to the exact renamed field.
- It works on the core Splunk every team has, with no ITSI, no Enterprise Security, no premium tier. Just saved searches, the 443 proxy, and the data.
What I learned
- An alert that fires is evidence the data feeding the detection changed. Reframing the whole problem around shape rather than symptoms made the noise/news split tractable.
- Let SPL compute, let the model narrate. An LLM earns its place in an ops loop by reasoning over the numbers while the gate decides them. A deterministic gate plus a model that names things outperforms a model that judges.
- Honest seams beat fake demos. Building the real path behind a feature flag, with a genuine fallback that produces the identical verdict, meant a missing trial app cost me a bonus and never the money shot.
What's next for Driftwood
- Auto-heal the detection: when Driftwood proves a rename, propose the one-line SPL patch (
rename http_status as status) via AI Assistant for SPL so the brittle search self-repairs. - Drift radar before the page: run the shape-fingerprint diff on a schedule, not just on alert, and kill the 3am page the hour the upstream rename lands.
- Confidence-scored verdicts: publish a calibrated noise/news confidence from the cardinality-match strength and the forecast residual.
- A security sibling: run the same shape-diff on threat detections ("your rule went silent because the sourcetype dropped, not because the attacker stopped"), paired with the Foundation-Sec hosted model.
The bigger picture
Alert fatigue is the number-one driver of on-call burnout, and it's why real incidents get missed inside the noise. Driftwood attacks a root cause nobody else names, schema drift breaking detections and masquerading as outages, and it hands the half-asleep on-call the one fact that ends the worst twenty minutes of the night.
Noise is when your view of the world changed. News is when the world did. Tell noise from news. Driftwood.
Built With
- cisco-deep-time-series
- css
- fieldsummary
- gemini
- gemini-flash-latest
- google-auth-library
- javascript
- mcp
- model-context-protocol
- next.js
- node.js
- react
- rest-api
- spl
- splunk
- splunk-ai-assistant-for-spl
- splunk-cloud
- splunk-hosted-models
- splunk-mcp-server
- splunk-ml-predict
- tstats
- typescript
- vercel
- vertex-ai
Log in or sign up for Devpost to join the conversation.