Project Story
Inspiration
Every SRE has lived this moment: it's 2 a.m., an alert fires, and the first 20 minutes of the incident are spent re-discovering what "normal" looks like. You squint at last week's dashboards. You grep for templates you swear you've seen before. You ask the on-call channel "wait, has PaymentGatewayTimeout always been in our logs?" and three engineers give three different answers.
The frustrating part isn't the incident — it's that the answer existed three weeks ago and nobody saved it. Splunk had the raw events, but raw events age out of retention. The institutional memory ("oh, that error always spikes after a payment-svc deploy") lived in one engineer's head, and that engineer was on PTO.
I wanted a tool that treats "what healthy looks like" as a named, versioned, queryable artifact — not a vibe. And once you have that artifact, the next obvious move is to let the agent remember every incident it's seen, so by the third recurrence it's already saying "this looks like drift #a1b2c3 — payment-svc deploy v2.14.1, rolled back. Want to check the deploy log?"
That's Anchor. The pitch in one line: turn observability state into memory the agent can build on.
What I learned
A few things stuck:
- LLMs are great narrators but terrible witnesses. Asking GPT/Qwen "compare last week to this week" without giving it structured data is a hallucination factory. The right division of labor is: deterministic SPL + math computes the diff, and the LLM only translates the ranked diff into prose. Every number in Anchor's report can be traced back to a query — the LLM never invents a metric.
- Persistence is the whole game for "agentic memory." A chatbot that forgets every session is just autocomplete. The moment you write structured artifacts (fingerprints, drift records, signal weights) into a store the agent can query next time, behavior changes qualitatively: it stops re-discovering and starts accumulating.
- Decay matters as much as recall. My first weight-update logic only went up. After ~50 simulated incidents, every signal had weight ≈ 3.0 and the ranking collapsed. Adding exponential decay back toward 1.0 (with a 30-day half-life) was the difference between a useful prior and a useless one. The agent has to forget on a schedule, the same way humans do.
- The CLI is the API. Building this as a
clicktool first — instead of a web UI — forced every feature to be composable and scriptable.anchor capture→anchor compare→anchor feedback→anchor learnedis the entire mental model, and it fits in a tweet. - Splunk KV Store is a genuinely nice agent backing store. Schema-less, REST-accessible, replicates with your search head, and survives long after raw events age out. For a hackathon project that wanted "memory" without bolting on Postgres, it was a perfect fit.
How I built it
Anchor is ~1,500 lines of Python organized around a six-stage pipeline:
flowchart LR
A[anchor capture] --> B[Splunk SPL]
B --> C[Fingerprint<br/>volume · templates ·<br/>errors · metrics]
C --> D[(KV Store<br/>anchors)]
E[anchor compare] --> B
E --> F[diff vs anchor<br/>+ learned weights]
F --> G[Qwen / Gemini<br/>narrator]
G --> H[Report:<br/>summary · hypothesis<br/>· top diffs · drill-in SPL]
H --> I[(KV Store<br/>drift_history)]
J[anchor feedback] --> K[update<br/>signal_weights] --> D
The pieces:
fingerprint.py— runs four SPL queries over a time window (per-source volume, templated log shape viacluster, error breakdown, metric percentiles) and packs the results into a Pydantic model. Same code path runs for the anchor and the compared window — apples to apples is enforced by construction.diff.py— pure functions. Diffs two fingerprints, computes(absolute_delta, percent_change)per signal, classifies severity, applies the learned weight, returns a ranked list. Zero side effects, fully unit-tested.- memory.py — the heart of the MemoryAgent loop.
recall_similar_driftsuses Jaccard overlap on signal sets (threshold 0.15, capped at top-3) to surface past confirmed incidents during a new compare.bump_appearance/apply_feedback/decay_weightsmaintain the weight table. Decay is throttled to once-per-hour to keep compare hot. narrator.py— builds a strict prompt template ($V_2$) with the structured diff and recalled past incidents as context. The LLM only sees pre-ranked, pre-quantified evidence and is explicitly told "do not invent metrics."cli.py— Click commands.capture,compare,feedback,learned,blind-spots,history,delete-drift,purge-drifts.splunk_client.py— thin wrapper oversplunklib, with a module-level connection cache and the four KV primitives (kv_all,kv_get,kv_insert,kv_update,kv_query,kv_delete).
The math the ranker uses, for the curious:
$$ \text{score}(s) = w_s \cdot \text{severity}(s) \cdot \log!\bigl(1 + |\Delta_s|\bigr) $$
where $w_s$ is the learned per-signal weight that decays toward $1.0$ at rate $\lambda = \ln 2 / 30\text{ days}$ and gets bumped on every confirmed drift:
$$ w_s \leftarrow w_s \cdot (1 + \alpha \cdot \mathbb{1}[\text{outcome} = \text{resolved}]),\quad \alpha = 0.15 $$
Hosting is a docker-compose Splunk + the Anchor CLI; production deploy notes target Alibaba Cloud (ECS for Splunk, OSS for nightly KV backups, Qwen/DashScope as the default LLM).
Challenges I faced
Cold-start UX. The first compare in a fresh install has no prior drifts to recall — and that's the whole story of the demo. The fix was two seeded log files (healthy.log + drifted.log) with deterministic anomalies, plus a demo script that splits the drifted window in half: first half builds the memory, second half shows recall firing. Without that scripting, the killer feature was invisible.
LLM hallucinations leaking into the report. Early prompts let Qwen "improve" the diff by mentioning signals that weren't in the input. I clamped the prompt to only operate on the structured payload, added a PROMPT_VERSION constant, and now the report's numbers are guaranteed traceable to the SPL queries.
Weight blow-up. Mentioned above — solved by adding decay and a throttle so a noisy day of compares doesn't trigger 50 redundant decay passes.
Equal/inverted compare windows returning cryptic HTTP 400 from Splunk. Fixed by validating start < end at the CLI boundary with a clear error message, before any backend call. Lesson: catch input errors at the trust boundary, not three layers down.
Connection churn against Splunk. Every command was opening a new HTTPS session — fine in dev, painful in a recorded demo. Module-level cached connection with a reset_connection() escape hatch fixed it.
Time pressure on the 4-minute demo. Solo hackathon → the demo is the project. I rebuilt the script three times before landing on a 3:30 runtime that still showed capture → diagnose → feedback → recall as four distinct beats. Recall is the moment the audience sees memory is real; if you cut it, you're just demoing a fancy diff tool.
Anchor — a MemoryAgent for SRE incident response. Splunk + Qwen + Alibaba Cloud, MIT-licensed, at github.com/faketut/Anchor.
Log in or sign up for Devpost to join the conversation.