Inspiration

We were inspired by the increasing number of LLM Coding Agents in the world, such as Claude Code, Codex, and more. We wanted to loop agents in on the extrinsic data that human developers rely on that agents are blind to.

What it does

OCTO_DOC scans ALL of the info about a repo, such as contributors, issue history, PR discussions, PR statuses, and more and exposes an MCP server with natural language output.

How we built it

We architected our project with an ingress and egress side.

For ingress, we used the Claude API to generate vector embeddings of the repo contents. We then stored these embeddings in a ChromaDB server.

For egress, we again used the Claude API to generate vector embeddings of the query. We then matched it to the parts of the ChromaDB. We then used FastMCP to provide two tools: query and query_and_summarize for agents to call.

Challenges we ran into

The main challenge we ran into was overall ping and redundant info. We addressed this problem by filtering out code chunks within the PR bodies and Issues, to avoid polluting the API output.

Accomplishments that we're proud of

We're proud of the work we did in "tagging" repo chunks to nudge the embeddings towards matching with relevant queries. We're also proud of the work we did in pre-summarizing and chunking the codebase. This made the ingest step about 3x faster.

What we learned

We learned about RAG, and Vector DBs like Chroma.

What's next for OCTO_DOC

Next steps for us include:

  • Auth and Roles. By allowing for roles, we can set restrictions on what docs and PRs a user is allowed to read.
  • More "extrinsic sources": We want to add in context about slack, meeting notes, and more, to further our goal, of allowing agents to fully understand project context.

Built With

Share this project:

Updates