Inspiration

Watch a maintainer's review history and you'll see the same thing, MR after MR: a link to the styleguide, a correction typed by hand, a rule explained to someone who'd never have found it. The rule is often written down. It still doesn't transfer — because knowing a convention and being able to hand someone the rule are different things. Michael Polanyi called it tacit knowledge: we know more than we can tell. A senior maintainer spots a weak title or an un-idiomatic metric name instantly, but couldn't always write you the rule. So it lives in their head, transferred one tired comment at a time.

That's the real bottleneck between a contributor's first MR and their first merge — not the code, the conventions no one ever wrote down.

What if a project's own review history could surface that tacit knowledge and put it to work — greeting every contributor on day one, instead of living only in a maintainer's head?

What it does

Lore looks like a code-review bot. It's really a project's tacit knowledge, made teachable. It mines years of reviewer comments to infer the conventions a project enforces in practice, then pre-reviews every new MR against them — posting evidence-backed feedback minutes after the webhook fires, before a human arrives.

  • Discovers conventions from behavior. It mines merged-MR history and extracts the conventions maintainers repeat, each backed by real reviewer quotes with MR attribution.
  • Reviews in real time, three ways. It flags judgment calls, applies mechanical fixes, or recuses to a recommended CI gate — and cites a real historical MR for every one.
  • Learns from correction. A maintainer's 👎 lowers a convention's confidence until it goes quiet — and the dismissal is stored and recalled on the next similar MR. Pointed at gitaly — a codebase it had never seen — Lore surfaced 7 conventions from 66 MRs in about 30 minutes for ~$5. Any project with a review history has conventions like these; that 30-minute, $5 run is the entire cost of onboarding one.

Try it out

  • Code: gitlab.com/mbrazinski-group/lore — agent, scripts, docs, 191 tests, MIT licensed.
  • Live demo: gitlab.com/mbrazinski-group/gitlab-runner — open MR #1 to watch Lore flag a title, apply a label, flag a missing test, and recuse on a CI-tier rule; MR #2 shows an all-clear on a clean MR.
  • Verify it's real: the gitaly rule about idiomatic Prometheus metric names is backed by the verbatim reviewer quote in MR !8515 — open it and compare.
  • Read the core insight: docs/graduation-insight.md. ## How we built it

The stack is what judges expect; the eval journey is what makes Lore trustworthy — so it goes first. The first working agent flagged everything: 147 false positives across six conventions, completely untrustworthy. Driven by measurement, not vibes, Lore's hardest convention — title quality — went from 87 false positives to 3. Here's the journey.

Discovery. Conventions live in reviewer comments, not config files — the same fix, requested over and over, is the signal. A five-stage pipeline extracts reviewer comments from merged MRs, classifies them, clusters the repeats, synthesizes each cluster into a convention, and reports a dashboard. On gitlab-runner it found six conventions; the strongest signal was raw repetition — the documentation styleguide correction appeared on 14 distinct MRs across five years. On gitaly, never seen before, it found seven — each citing the MRs its evidence came from.

The eval. False positives erode trust faster than false negatives ever do, so Lore's team built a harness that scores precision and recall per convention against 96 hand-labeled holdout MRs. Three stages drove the number down. Stage 1: five prompt variants on a subset — a two-pass reasoning structure generalized best. Stage 2: the full holdout exposed the real lever — no prompt could fix test-coverage false positives because the model was guessing whether a test file existed, so Lore stopped guessing and started fetching the directory's file tree. Input beats prompting. Stage 3: the stronger model (Gemini 3.1 Pro) surfaced a humbling twist — some "false positives" were real violations the maintainer never corrected; rather than grade our own ground truth, we kept the stricter baseline. A final SCREEN→CONFIRM two-pass prompt and temperature 0.2 collapsed run-to-run variance from wildly unstable to nearly fixed — these are the same edge cases a human would argue about, but at least Lore is consistently on them.

Convention Tier Precision Recall Note
Title quality Judgment 0.667 0.857 3 false positives — down from 87
Label correctness Mechanical 0.500 0.800 "is a type:: label present?"
Test coverage Structural correctly silent — 0 violations in the holdout

Label-correctness at 0.50 is the honest number — and per the graduation insight below, it's a mechanical check that belongs in CI, not the LLM. The headline isn't a one-run artifact either: a separate, clean 96-MR run reproduces it exactly against the published ground truth.

The graduation insight. Running those evals revealed something we didn't design — conventions sort into three tiers, and the tier dictates how much LLM a convention needs:

Tier Example What it needs Where it belongs
Mechanical Is a type:: label present? A set intersection — the LLM adds variance, not accuracy Graduates to CI
Structural Does new code have a test file? Repository context — tool augmentation (the file-tree fetch that beat every prompt) Agent + tools
Judgment (tacit) Is this title changelog-quality? Irreducibly subjective reasoning The LLM — where the ≈29× lives

The payoff is a cost curve that bends down: as mechanical conventions are identified, they graduate out of the model into CI rules, so per-review cost falls as a project matures. The strongest-mined convention — the styleguide, 14 MRs — is one Lore deliberately keeps silent, because it's mechanical. The tiers run explicit to tacit: what can be codified graduates to CI; what can't is exactly where Lore earns its keep.

The stack.

        ┌──────────┐   webhook    ┌─────────────────────────────┐
        │ GitLab MR│ ───────────► │ FastAPI receiver (Cloud Run)│
        └──────────┘              │ enqueue → return 200 fast   │
              ▲                    └──────────────┬──────────────┘
   REST writes│                       async       │
   • threads  │                      trampoline   ▼
   • labels   │                              ┌───────────┐
   • summary  │                              │Cloud Tasks│
              │                              └─────┬─────┘
              │                      /process      ▼  (20–25s review)
              │                      ┌──────────────────────────────┐
              └───────────────────── │ Agent — Google ADK            │
                                     │ reasoning: Gemini 3.1 Pro     │
                                     └──────┬─────────────────┬──────┘
                            reads via MCP    │                 │ persists
                                  ┌──────────▼─┐      ┌────────▼──────────────┐
                                  │ GitLab MCP │      │ Vertex AI Memory Bank │
                                  │ MR+history │      │ corrections +         │
                                  └────────────┘      │ confidence values     │
                                                      └───────────────────────┘
   Secret Manager — stores & rotates the GitLab OAuth credentials
Layer Component Job
Ingress FastAPI on Cloud Run Receives the webhook, enqueues, returns 200 instantly
Queue Cloud Tasks Deduplicating async trampoline so a 20–25s review never trips the webhook timeout
Reasoning Google ADK + Gemini 3.1 Pro Preview The two-pass SCREEN→CONFIRM review and convention reasoning
Read GitLab MCP Server Verified reads: citation enrichment, label validation
Write GitLab REST Resolvable threads, label quick-actions, the summary table
Memory Vertex AI Memory Bank Persists maintainer corrections and confidence values
Secrets Secret Manager Stores and rotates the GitLab OAuth credentials

Lore runs on Google ADK with Gemini 3.1 Pro Preview. A GitLab webhook hits a FastAPI receiver on Cloud Run, which enqueues to Cloud Tasks and returns 200 — a deduplicating trampoline (event_uuid + mr_iid) so a 20-second review never blocks GitLab's timeout. The diff is fetched over REST so the prompt is fully populated before the model runs. Three layers carry their own story:

Vertex AI Memory Bank turns corrections into durable, retrievable context. When a maintainer dismisses a flag, Lore writes it to project-scoped memory and retrieves it by similarity on the next MR — so a dismissal on "Fix cleanup bug" surfaces on a later "Fix config bug," changing a live decision. Only a maintainer's correction becomes a memory (verified by project role), so a contributor can never poison what Lore learns.

GitLab MCP Server provides the verified ground truth that prevents broken citations and phantom labels — fetching a cited MR's real title, and confirming a label exists before Lore applies it. Both reads run after detection and fail open, so a slow read never blocks a review; writes go over REST, which the official MCP server doesn't expose.

Secret Manager handles the OAuth-only MCP server (a personal access token returns 403): Lore bootstraps and rotates the OAuth credentials there.

Challenges we ran into

Ground-truth circularity. When the agent found genuine violations the maintainer had never corrected, we could have reclassified them — and our precision would have jumped to 0.89. We didn't. That depends on our own judgment of what "should" have been flagged, the exact circularity that makes a number indefensible. We publish the stricter 0.67.

The principle was right; the plumbing wasn't. Live verification on a real MR caught that the context enricher had never actually worked in production — it parsed the wrong diff-header format and silently returned no context, invisible because the eval harness shared the same bug. We fixed it; the agent now reasons about real file trees.

Write-only is a trap. Two "finished" features updated state nothing read — the confidence dial moved on a 👎 but the gate never consulted it, and Memory Bank wrote memories nothing retrieved. We wired both read paths. Verifying that a feature runs is not the same as verifying it changes a decision.

Accomplishments we're proud of

Three numbers tell it: a ≈29× false-positive reduction on the hardest convention, 191 tests, ~16,000 lines built solo (2,400 runtime, 3,500 tests).

Beyond the numbers: every prediction cites a real MR from the project's own history, carrying that MR's verified title. Three action modes — flag, apply, recuse — each backed by evidence. And discrimination as a first-class feature: of six conventions, Lore flags or fixes three, recuses on one, and stays silent on two. Choosing not to flag is as much the architecture as flagging.

What we learned

  • The eval is the product — without a precision number, "improving the prompt" is superstition.
  • Input beats prompting for structural rules — the file tree outperformed every prompt rewrite.
  • Write-only is a trap — a feature that runs is not a feature that works.
  • Not everything needs an LLM — mechanical checks belong in CI; the model should graduate out of them.
  • Ground truth is a moving target — sometimes the agent is right and the label is wrong, which is exactly why you don't get to grade your own.
  • Temperature is a classification knob — 0.2 was the difference between an agent that flickered and one a maintainer could trust. ## What's next

Automatic graduation. Conventions export themselves as CI rules — Danger plugins, AST checks — so a project's LLM cost falls as its rules mature.

Convention evolution. Periodic re-mining to detect when a project's patterns shift, so Lore enforces today's architecture, not yesterday's.

Memory Bank ablation. Running the eval with and without episodic memory to isolate its statistical impact — the live round-trip is proven; the number is owed.

Lore's bet is simple: the tacit knowledge a project runs on shouldn't live and die in its maintainers' heads — it should greet every contributor on day one.

Built With

  • cloud-run
  • cloud-tasks
  • fastapi
  • gemini-pro-3.1-preview
  • gitlab-mcp
  • google-adk
  • python
  • secret-manager
  • vertex-ai-memory-bank
Share this project:

Updates