Inspiration
Watch a maintainer's review history and you'll see the same thing, MR after MR: a link to the styleguide, a correction typed by hand, a rule explained to someone who'd never have found it. The rule is often written down. It still doesn't transfer — because knowing a convention and being able to hand someone the rule are different things. Michael Polanyi called it tacit knowledge: we know more than we can tell. A senior maintainer spots a weak title or an un-idiomatic metric name instantly, but couldn't always write you the rule. So it lives in their head, transferred one tired comment at a time.
That's the real bottleneck between a contributor's first MR and their first merge — not the code, the conventions no one ever wrote down.
What if a project's own review history could surface that tacit knowledge and put it to work — greeting every contributor on day one, instead of living only in a maintainer's head?
What it does
Lore looks like a code-review bot. It's really a project's tacit knowledge, made teachable. It mines years of reviewer comments to infer the conventions a project enforces in practice, then pre-reviews every new MR against them — posting evidence-backed feedback minutes after the webhook fires, before a human arrives.
- Discovers conventions from behavior. It mines merged-MR history and extracts the conventions maintainers repeat, each backed by real reviewer quotes with MR attribution.
- Reviews in real time, three ways. It flags judgment calls, applies mechanical fixes, or recuses to a recommended CI gate — and cites a real historical MR for every one.
- Learns from correction. A maintainer's 👎 lowers a convention's confidence until it goes quiet — and the dismissal is stored and recalled on the next similar MR. Pointed at gitaly — a codebase it had never seen — Lore surfaced 7 conventions from 66 MRs in about 30 minutes for ~$5. Any project with a review history has conventions like these; that 30-minute, $5 run is the entire cost of onboarding one.
Try it out
- Code:
gitlab.com/mbrazinski-group/lore— agent, scripts, docs, 191 tests, MIT licensed. - Live demo:
gitlab.com/mbrazinski-group/gitlab-runner— open MR #1 to watch Lore flag a title, apply a label, flag a missing test, and recuse on a CI-tier rule; MR #2 shows an all-clear on a clean MR. - Verify it's real: the gitaly rule about idiomatic Prometheus metric names is backed by the verbatim reviewer quote in MR !8515 — open it and compare.
- Read the core insight:
docs/graduation-insight.md. ## How we built it
The stack is what judges expect; the eval journey is what makes Lore trustworthy — so it goes first. The first working agent flagged everything: 147 false positives across six conventions, completely untrustworthy. Driven by measurement, not vibes, Lore's hardest convention — title quality — went from 87 false positives to 3. Here's the journey.
Discovery. Conventions live in reviewer comments, not config files — the same fix, requested over and over, is the signal. A five-stage pipeline extracts reviewer comments from merged MRs, classifies them, clusters the repeats, synthesizes each cluster into a convention, and reports a dashboard. On gitlab-runner it found six conventions; the strongest signal was raw repetition — the documentation styleguide correction appeared on 14 distinct MRs across five years. On gitaly, never seen before, it found seven — each citing the MRs its evidence came from.
The eval. False positives erode trust faster than false negatives ever do, so Lore's team built a harness that scores precision and recall per convention against 96 hand-labeled holdout MRs. Three stages drove the number down. Stage 1: five prompt variants on a subset — a two-pass reasoning structure generalized best. Stage 2: the full holdout exposed the real lever — no prompt could fix test-coverage false positives because the model was guessing whether a test file existed, so Lore stopped guessing and started fetching the directory's file tree. Input beats prompting. Stage 3: the stronger model (Gemini 3.1 Pro) surfaced a humbling twist — some "false positives" were real violations the maintainer never corrected; rather than grade our own ground truth, we kept the stricter baseline. A final SCREEN→CONFIRM two-pass prompt and temperature 0.2 collapsed run-to-run variance from wildly unstable to nearly fixed — these are the same edge cases a human would argue about, but at least Lore is consistently on them.
| Convention | Tier | Precision | Recall | Note |
|---|---|---|---|---|
| Title quality | Judgment | 0.667 | 0.857 | 3 false positives — down from 87 |
| Label correctness | Mechanical | 0.500 | 0.800 | "is a type:: label present?" |
| Test coverage | Structural | — | — | correctly silent — 0 violations in the holdout |
Label-correctness at 0.50 is the honest number — and per the graduation insight below, it's a mechanical check that belongs in CI, not the LLM. The headline isn't a one-run artifact either: a separate, clean 96-MR run reproduces it exactly against the published ground truth.
The graduation insight. Running those evals revealed something we didn't design — conventions sort into three tiers, and the tier dictates how much LLM a convention needs:
| Tier | Example | What it needs | Where it belongs |
|---|---|---|---|
| Mechanical | Is a type:: label present? |
A set intersection — the LLM adds variance, not accuracy | Graduates to CI |
| Structural | Does new code have a test file? | Repository context — tool augmentation (the file-tree fetch that beat every prompt) | Agent + tools |
| Judgment (tacit) | Is this title changelog-quality? | Irreducibly subjective reasoning | The LLM — where the ≈29× lives |
The payoff is a cost curve that bends down: as mechanical conventions are identified, they graduate out of the model into CI rules, so per-review cost falls as a project matures. The strongest-mined convention — the styleguide, 14 MRs — is one Lore deliberately keeps silent, because it's mechanical. The tiers run explicit to tacit: what can be codified graduates to CI; what can't is exactly where Lore earns its keep.
The stack.
┌──────────┐ webhook ┌─────────────────────────────┐
│ GitLab MR│ ───────────► │ FastAPI receiver (Cloud Run)│
└──────────┘ │ enqueue → return 200 fast │
▲ └──────────────┬──────────────┘
REST writes│ async │
• threads │ trampoline ▼
• labels │ ┌───────────┐
• summary │ │Cloud Tasks│
│ └─────┬─────┘
│ /process ▼ (20–25s review)
│ ┌──────────────────────────────┐
└───────────────────── │ Agent — Google ADK │
│ reasoning: Gemini 3.1 Pro │
└──────┬─────────────────┬──────┘
reads via MCP │ │ persists
┌──────────▼─┐ ┌────────▼──────────────┐
│ GitLab MCP │ │ Vertex AI Memory Bank │
│ MR+history │ │ corrections + │
└────────────┘ │ confidence values │
└───────────────────────┘
Secret Manager — stores & rotates the GitLab OAuth credentials
| Layer | Component | Job |
|---|---|---|
| Ingress | FastAPI on Cloud Run | Receives the webhook, enqueues, returns 200 instantly |
| Queue | Cloud Tasks | Deduplicating async trampoline so a 20–25s review never trips the webhook timeout |
| Reasoning | Google ADK + Gemini 3.1 Pro Preview | The two-pass SCREEN→CONFIRM review and convention reasoning |
| Read | GitLab MCP Server | Verified reads: citation enrichment, label validation |
| Write | GitLab REST | Resolvable threads, label quick-actions, the summary table |
| Memory | Vertex AI Memory Bank | Persists maintainer corrections and confidence values |
| Secrets | Secret Manager | Stores and rotates the GitLab OAuth credentials |
Lore runs on Google ADK with Gemini 3.1 Pro Preview. A GitLab webhook hits a FastAPI receiver on Cloud Run, which enqueues to Cloud Tasks and returns 200 — a deduplicating trampoline (event_uuid + mr_iid) so a 20-second review never blocks GitLab's timeout. The diff is fetched over REST so the prompt is fully populated before the model runs. Three layers carry their own story:
Vertex AI Memory Bank turns corrections into durable, retrievable context. When a maintainer dismisses a flag, Lore writes it to project-scoped memory and retrieves it by similarity on the next MR — so a dismissal on "Fix cleanup bug" surfaces on a later "Fix config bug," changing a live decision. Only a maintainer's correction becomes a memory (verified by project role), so a contributor can never poison what Lore learns.
GitLab MCP Server provides the verified ground truth that prevents broken citations and phantom labels — fetching a cited MR's real title, and confirming a label exists before Lore applies it. Both reads run after detection and fail open, so a slow read never blocks a review; writes go over REST, which the official MCP server doesn't expose.
Secret Manager handles the OAuth-only MCP server (a personal access token returns 403): Lore bootstraps and rotates the OAuth credentials there.
Challenges we ran into
Ground-truth circularity. When the agent found genuine violations the maintainer had never corrected, we could have reclassified them — and our precision would have jumped to 0.89. We didn't. That depends on our own judgment of what "should" have been flagged, the exact circularity that makes a number indefensible. We publish the stricter 0.67.
The principle was right; the plumbing wasn't. Live verification on a real MR caught that the context enricher had never actually worked in production — it parsed the wrong diff-header format and silently returned no context, invisible because the eval harness shared the same bug. We fixed it; the agent now reasons about real file trees.
Write-only is a trap. Two "finished" features updated state nothing read — the confidence dial moved on a 👎 but the gate never consulted it, and Memory Bank wrote memories nothing retrieved. We wired both read paths. Verifying that a feature runs is not the same as verifying it changes a decision.
Accomplishments we're proud of
Three numbers tell it: a ≈29× false-positive reduction on the hardest convention, 191 tests, ~16,000 lines built solo (2,400 runtime, 3,500 tests).
Beyond the numbers: every prediction cites a real MR from the project's own history, carrying that MR's verified title. Three action modes — flag, apply, recuse — each backed by evidence. And discrimination as a first-class feature: of six conventions, Lore flags or fixes three, recuses on one, and stays silent on two. Choosing not to flag is as much the architecture as flagging.
What we learned
- The eval is the product — without a precision number, "improving the prompt" is superstition.
- Input beats prompting for structural rules — the file tree outperformed every prompt rewrite.
- Write-only is a trap — a feature that runs is not a feature that works.
- Not everything needs an LLM — mechanical checks belong in CI; the model should graduate out of them.
- Ground truth is a moving target — sometimes the agent is right and the label is wrong, which is exactly why you don't get to grade your own.
- Temperature is a classification knob — 0.2 was the difference between an agent that flickered and one a maintainer could trust. ## What's next
Automatic graduation. Conventions export themselves as CI rules — Danger plugins, AST checks — so a project's LLM cost falls as its rules mature.
Convention evolution. Periodic re-mining to detect when a project's patterns shift, so Lore enforces today's architecture, not yesterday's.
Memory Bank ablation. Running the eval with and without episodic memory to isolate its statistical impact — the live round-trip is proven; the number is owed.
Lore's bet is simple: the tacit knowledge a project runs on shouldn't live and die in its maintainers' heads — it should greet every contributor on day one.
Built With
- cloud-run
- cloud-tasks
- fastapi
- gemini-pro-3.1-preview
- gitlab-mcp
- google-adk
- python
- secret-manager
- vertex-ai-memory-bank
Log in or sign up for Devpost to join the conversation.