The Problem

Here's something that happens all the time. A developer opens a merge request, changes a function that looked isolated, and two days later something breaks in production that nobody expected. The function had callers nobody knew about. The callers fed into a pipeline nobody remembered was there. The reviewer looked at the diff and approved it because the diff looked fine.

This isn't a process failure. It's an information failure. The tools developers use don't give them structural context at the moment they need it.

Git blame tells you who changed a line, not what depends on it. CI pipelines tell you what failed after the fact. Code review is a human reading a diff with no map of the surrounding system. And when a security vulnerability gets reported, teams spend days manually tracing how far it reaches because nothing connects the dots automatically.

We measured the scale of this. Code churn doubled between 2021 and 2024 according to GitClear's analysis of 153 million lines of code. 63% of developers spend more than 30 minutes a day just searching for answers about their own codebase. DORA research found that quality documentation and structural knowledge predict 40% higher delivery performance. The information gap is real and it's getting worse as codebases grow and teams move faster.

GitLab Orbit changes what's possible here. It indexes an entire codebase into a queryable property graph — functions, callers, imports, classes, pipelines, users, merge requests, vulnerabilities, all connected. You can ask it things like "what functions call this one?" or "what pipelines deploy files in this directory?" and get real answers from the actual structure of the code, not inferences from text.

Sankofa is the layer on top of that graph that puts it to work where developers are already working: in their merge requests, in their issues, and in their security workflows.

What it does

Sankofa is three agents on the GitLab Duo Agent Platform, each triggered at a different moment in the development lifecycle.

Sankofa Radar — blast radius analysis

Radar runs when a merge request is created or updated. The question it answers is: what does this change actually affect?

When you @mention the Sankofa Radar flow on an MR, it reads the diff to identify the changed files and functions. It then runs four Orbit queries in sequence. First a traversal query to find every function that calls the changed code, up to three hops out. Second a neighbors query to find every pipeline that deploys the changed files. Third an aggregation query to count the total downstream dependents and produce an impact score. Fourth a pathfinding query to check whether any known vulnerabilities are reachable from the changed code.

It synthesizes all of that into a structured comment on the MR. The comment shows a risk level (Low through Critical), an impact score out of 100, a table of downstream callers with file paths and owners, affected pipelines with their environments, any vulnerability connections, and a plain-language recommendation. The reviewer sees all of this before approving, not after something breaks.

The practical difference: a one-line change to a utility function might touch 12 downstream callers and two production pipelines. Without Radar, the reviewer sees one changed file and approves. With Radar, they see the full picture and know to loop in the teams who own those callers.

Sankofa Guide — contextual onboarding

Guide runs when a developer is assigned an issue in code they haven't worked in before. The question it answers is: where do I start?

When triggered on an issue, it reads the issue title and description to identify the relevant code area. It then runs Orbit queries to find the key files and their function definitions, the top contributors to those files in the last 90 days, recent merge requests that touched the same area, and the test files connected to those source files.

It posts a knowledge brief on the issue. The brief has a table of key files with their purpose, primary author, and when they were last changed. An architecture context section explaining how the files connect and what role they play in the system. A "who to ask" section with the most relevant contributors and why they're relevant. Recent MRs for context on what's been changing in this area. Test coverage showing which test files exist and how to run them. And a getting started section that points to the specific function or entry point to start reading.

What used to take half a day of exploration — reading files, running git blame, asking colleagues — happens in under a minute. The developer lands on the issue already oriented.

Sankofa Shield — vulnerability propagation tracing

Shield does something neither of the other agents can: it traces a vulnerability forward through the entire codebase to find every code path that reaches it.

When triggered on a vulnerability issue, it extracts the affected file and function from the issue description. It then runs Orbit pathfinding queries to follow every caller chain from the vulnerable function outward, continuing until it reaches public endpoints or runs out of hops (up to 10 deep). It also runs traversal queries to find dependent services and neighbor queries to find owning users.

From that data it builds a full propagation tree showing every code path from the vulnerability to public exposure. It calculates how many public endpoints are exposed, how many services are affected, and which teams own the affected code. It posts an exposure report on the vulnerability issue with the full propagation tree, an exposure summary table, a containment strategy explaining the minimal fix points, and an affected teams table with priority levels.

Then it creates child issues. One per affected team, each with the specific propagation path that affects that team, the required remediation steps for their code, and acceptance criteria so they know when the fix is complete.

In our demo, Shield traced a CVE for an MD5 collision vulnerability in a crypto utility. The propagation went four hops deep: weak_hash() in the crypto module fed into generate_token() in the auth layer, which fed into login() and handle_token_refresh(), which fed into five public API endpoints. The billing module used the same vulnerable function for invoice checksums. Shield posted the full tree, identified three affected teams, and created targeted containment issues for each within about 90 seconds.

How we built it

Agent platform

The three agents are YAML definitions registered with the GitLab AI Catalog via the ai-catalog/catalog-sync CI component. Each has a system prompt, a toolset, and a flow definition. When someone @mentions a flow's service account in an issue or MR comment, the platform provisions a sandboxed Docker container, injects the flow config and goal, and runs the duo-workflow executor. The executor handles the AI reasoning loop — making tool calls, interpreting results, and running the next tool — until the agent posts its final output.

The service accounts are named after the flows: @ai-sankofa-radar-flow-gitlab-ai-hackathon, @ai-sankofa-guide-flow-gitlab-ai-hackathon, and @ai-sankofa-shield-flow-gitlab-ai-hackathon. There's also a unified flow (@ai-sankofa-unified-gitlab-ai-hackathon) that routes to the correct agent based on context — useful when you just want to @mention a single bot and let it figure out what to do.

Orbit integration

The Orbit queries run via run_command with glab orbit remote query - reading JSON from stdin. Here's what the pathfinding query for Shield looks like:

{
  "query": {
    "query_type": "pathfinding",
    "source": {
      "entity": "Definition",
      "filters": {
        "file_path": {"op": "eq", "value": "src/crypto/hash.py"},
        "name": {"op": "eq", "value": "weak_hash"}
      }
    },
    "target": {
      "entity": "Definition",
      "filters": {
        "visibility": {"op": "eq", "value": "public"}
      }
    },
    "max_hops": 10
  }
}

And the traversal query Radar uses to find callers:

{
  "query": {
    "query_type": "traversal",
    "node": {
      "id": "d",
      "entity": "Definition",
      "filters": {
        "file_path": {"op": "eq", "value": "src/auth/handler.py"},
        "name": {"op": "eq", "value": "authenticate"}
      }
    },
    "edges": [{"type": "called_by", "direction": "incoming", "hops": 3}],
    "limit": 50
  }
}

One thing worth noting: when Orbit queries return empty (the project graph may not be fully indexed), the agents fall back to static analysis using grep and file reads on the repository. Shield ran this way in our demo. It tried the Orbit pathfinding query, got nothing back, then grepped the source files for import patterns and function calls. It found the same propagation chain the graph would have returned. The fallback isn't as fast but it means the agents are useful immediately, before Orbit has had time to build a full index.

Cloud Run backend

We built a FastAPI service on Cloud Run with three endpoints — /radar, /guide, and /shield — that accept context data and return Gemini-generated analysis. The service uses the google-genai SDK with gemini-2.5-flash. In the current implementation the Duo Agent Platform handles most of the reasoning directly, but the Cloud Run service is there for cases where you want to call the analysis outside the agent context, or where you need to pass large amounts of Orbit data to Gemini without going through the agent's context window.

Repository structure

The demo project has a realistic multi-module Python application that Orbit can actually index:

  • src/crypto/hash.py — weak_hash() using MD5, plus a safe strong_hash() using SHA-256
  • src/auth/tokens.py — token generation using weak_hash for fingerprinting
  • src/auth/handler.py — authentication middleware, calls generate_token()
  • src/auth/session.py — session management, calls refresh_token()
  • src/api/routes.py — five public API endpoints calling into the auth and billing layers
  • src/billing/invoice.py — invoice creation and signing, also uses weak_hash
  • src/notifications/alerts.py — security alerting

This gives Orbit a real dependency graph to traverse. The weak_hash function has actual downstream callers in production code paths that reach public endpoints — which is what made the Shield demo meaningful rather than contrived.

28 tests across all modules, all passing.

Challenges

Orbit DSL in beta. The query language is documented but sparse. We worked from the schema introspection and trial and error to get the entity types and filter operators right. The pathfinding query in particular took several iterations to get the source/target structure correct.

WebSocket timeout. The platform drops connections around the 2-minute mark. Shield's full run — read issue, three Orbit queries, grep fallback, post the propagation report, create three child issues — runs right at that boundary. We saw it drop on the last child issue create and complete on retry. The session log shows exactly this: the billing team child issue was being created when the WebSocket closed with code 1006, the platform retried, and it completed successfully. For a production deployment you'd split the flow into a report stage and a create-issues stage, or handle the timeout more gracefully.

Service account discovery. The catalog-sync component creates service accounts automatically but doesn't surface their usernames anywhere obvious. We had to inspect the runner logs to find the @mention handles for each flow. Once found they worked exactly as expected, but it wasn't documented.

What's next

The most useful improvement is connecting Orbit's MCP interface directly so agents receive structured JSON graph data rather than parsing CLI text output. The current approach works but the agent has to interpret text representations of graph results, which is less reliable than operating on the structure directly.

The Radar impact score is a heuristic built on caller count and pipeline environment. With real incident data — which changes caused production failures and which didn't — you could build a much more accurate signal. Radar would go from "this touches 8 callers in a production pipeline, so it's High" to "changes with this pattern have caused incidents 23% of the time historically."

Shield could be extended to suggest the fix directly. It knows the vulnerable function and every caller. With a few more Orbit queries on function signatures and test coverage, it could generate the patch at the source and open an MR automatically.

Built With

Share this project:

Updates