Codebase Onboarding Agent

## Inspiration

Every engineering team has the same silent tax: the week a new developer disappears into a codebase, reading files, tracing imports, and interrupting senior engineers with "where should I even start?" The onboarding problem isn't a lack of documentation — it's that the knowledge lives inside the graph of relationships between code, people, and history, and there's no easy way to surface it.

When I saw that GitLab Orbit indexes the entire SDLC — projects, files, modules, merge requests, pipelines, and ownership — into a queryable knowledge graph, the idea was immediate: what if you could turn that graph into a personalized onboarding guide in minutes?

## What It Does

Codebase Onboarding Agent takes a GitLab group or project path and produces a complete onboarding package grounded entirely in Orbit's Knowledge Graph:

Architecture overview — plain-English description of what the system does and how it's organized
Key files — the most central files in the graph, each with its role and a direct GitLab source link
Module dependency map — an interactive React Flow graph of DEPENDS_ON edges so new engineers can identify safe change zones
Setup & local run — steps derived from the CI/compose files Orbit indexed, not from assumptions
Ownership table — "who do I ask before touching core?" answered from Orbit's OWNS/CODEOWNER_OF edges
Non-obvious risks — high-centrality files and critical paths flagged with severity badges
Learning path — first day / first week / first PR tasks, each pointing at relevant files in the graph
Ask the codebase — a grounded Q&A chat where the agent refuses when Orbit has no context, rather than hallucinating

The grounding constraint is a deliberate design choice. Every summary, every answer, every citation traces back to a node or edge in the Orbit graph. An unanswered question surfaces a refusal, not a confident guess.

## How I Built It

The architecture is a three-layer pipeline:

$$ \text{Orbit Graph} \xrightarrow{\text{Cypher-like queries}} \text{Context Bundle} \xrightarrow{\text{Claude}} \text{Structured Summary} \xrightarrow{\text{React}} \text{Onboarding UI} $$

Backend — FastAPI + Python

An OrbitClient runs six targeted graph queries against the Orbit REST API for any given target:

MATCH (p:Project {path: $target})-[:CONTAINS]->(f:File) WHERE f.centrality > 0.4 OR f.is_entrypoint = true RETURN f.path, f.role, f.centrality, f.web_url ORDER BY f.centrality DESC LIMIT 20

The results are bundled into a structured context payload. An LLMService then passes that payload to Anthropic Claude with a strict grounding prompt — Claude's job is to synthesize the graph data into readable prose and structured JSON fields (architecture overview, risks, learning path), not to answer from training knowledge.

Results are cached in MongoDB (via the async motor driver) with a configurable TTL, so repeated visits to the same target are instant.

Graceful degradation in two layers

I wanted the app to work in any environment — no GitLab token, no Anthropic key, no problem:

ORBIT_API_URL not set → OrbitClient switches to a realistic mock graph that simulates a payments service with modules, owners, MRs, and pipelines
ANTHROPIC_API_KEY not set → LLMService falls back to deterministic template functions that still produce Orbit-grounded output

Zero code changes are needed to switch between modes — it's all driven by environment config.

Frontend — React + Vite

Six pages built as a SPA: Landing, Dashboard, Repo Map (React Flow), Learning Path, Ask the Codebase (chat), and Settings. A TargetContext persists the active project across pages via localStorage. Dark/light mode is supported via a ThemeContext.

The Vite dev server proxies /api to the FastAPI backend, keeping the frontend origin-agnostic between development and production.

## Challenges

Staying strictly grounded

The hardest design challenge was enforcing grounding as a hard constraint, not a guideline. The temptation is to let the LLM "fill in gaps" when Orbit returns sparse data — but a hallucinated file path or a fabricated owner is worse than no answer. I solved this by making the refusal path a first-class feature: if fetch_context_bundle returns an empty graph, the service raises OrbitError('empty') rather than proceeding, and the UI surfaces a clear, actionable message.

Citation integrity

Every file path and MR link in the UI had to trace back to an actual Orbit node. I built the citations list directly from the raw query rows — f.web_url, mr.web_url, pl.web_url — rather than constructing URLs from string templates, so there's no risk of a citation pointing to a non-existent resource.

Port contention in development

A less glamorous challenge: Docker Desktop was binding 0.0.0.0:8000 on my machine (for an unrelated container), which intercepted API calls from the Vite proxy before they could reach the FastAPI backend on 127.0.0.1:8000. Debugging this required reading Windows netstat output, identifying the conflicting PID, and pinning the Vite proxy target to 127.0.0.1:8000 explicitly.

Structuring LLM output reliably

Claude returns JSON inside the structured fields, but LLM output isn't perfectly reliable — code fences, trailing text, encoding artifacts. I wrote a _strip_fences() helper and wrapped every json.loads() in a try/except that falls back to the deterministic template, so a malformed LLM response degrades gracefully rather than breaking the onboarding flow.

## What I Learned

Orbit's Knowledge Graph is genuinely rich. The DEPENDS_ON, CONTAINS, OWNS, and pipeline edges together tell a story about a codebase that static analysis alone can't — because it includes people and history, not just code.
Refusals are a product feature. Users trust a tool more when it says "I don't know" cleanly than when it guesses plausibly. Making the agent refuse on missing context was the right call.
Graceful degradation should be designed in from day one. Building mock → template → LLM as three independent layers meant the app was always demoable, even when working on live Orbit integration.
Graph centrality is a surprisingly good proxy for "where should I look first." Files with high centrality scores — many inbound and outbound edges — almost always turn out to be the entry points, routers, or core domain objects a new engineer should read before anything else.

Built With

anthropicpythonsdk
docker
fastapi
github
graceful-degradation
graph-querying
httpx
mongodb
motor
pydantic
python
rag-style-grounding
react18
reactflow
ttl-caching-mongo-db
uvicorn
vanila-css
vite
vite-dev-proxy

Updates

Murali Krishna Konduru started this project — Jun 22, 2026 03:58 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.