🧠 Inspiration
Our experiences with open-source programming inspired this project. I often wished for a semantic way to search for specific code snippets, files, or features. This RAG tool is a direct response to that challenge. It gives developers a way to query an LLM about their entire repository of code directly, making it easy for developers to find what they need quickly and accurately.
💻 What it does
First, the user inputs a GitHub link, which we parse using LlamaIndex's GithubRepositoryReader. Next, we use Jina AI's code embeddings API to create embeddings for the read files and store them in an index. After the index is made, the user can query about the code, and the RAG will respond with the knowledge the codebase gives.
🔨 How we built it
We began our backend process using LlamaIndex's framework; we utilized a couple of Llamaindex's libraries, including TiDBVectorStore and VectorStoreIndex. This framework facilitates our general process of parsing the repository, generating embeddings, forming the index, creating a query engine, and returning a response. And for the frontend, we built it on top of streamlit framework which allows you to host the application.
🏃🏻 Challenges we ran into
- Parsing repositories
- Defining a scope (what did we want the project to do what did we want it to be used for)
- Timezone
📌 Accomplishments that we're proud of
- Frontend design with streamlit.
- Fantastic responses with deepseek-coder
- Seamless repo parsing a vector generation
📖 What we learned
- LLamaIndex's framework
- Vector Embeddings/TiDB
- Ollama
- Streamlit
What's next for Semantic Search Engine for Code Repositories
- Reducing wait time (possibly use parallel computing with ray)
- Optimizing embeddings and LLM settings
- User auth and saving queries by repository
- Create a networked Ollama server.
Log in or sign up for Devpost to join the conversation.