Inspiration

Code search (for me at least) is one of the most important tools during coding. I noticed that searching and navigating through code heavily outweighs the actual coding.

When I'm already familiar with a codebase, I find that exact search is what I need (for example regular expressions and case insensitive search). That is because I'm already familiar with the naming scheme, and I can reasonably predict how to formulate a search query. An example of this is the search tool from Visual Studio Code. It's my daily driver because it's fast, accurate, and reliable.

But when I'm not familiar with the codebase, or even a new folder within an existing large project, it makes searching more difficult because it's more trial and error. A tool like CodeSnippetSearch would allow me to easily explore unfamiliar code focusing on the semantics without getting bogged down in the syntax. This is especially useful when onboarding a new developer onto a project because it can be a significant boost to their productivity.

Outside of a work environment, we encounter unfamiliar code in the form of GitHub repositories. Semantic search tools would provide a faster way for users to find answers to their issues directly in the code. Consequently, it would lessen the burden on maintainers to provide these answers. Quickly locating the source of their problems would hopefully also encourage users to contribute to the repository.

What it does

CodeSnippetSearch, a web application, and a web extension that allows you to search GitHub repositories using natural language queries and code itself.

How I built it

The main data source for CodeSnippetSearch is Github's CodeSearchNet project. It contains approximately 6 million functions from 6 programming languages (Go, Python, Php, Java, Ruby, and Javascript). CodeSearchNet also provides various baseline implementations of neural code search in Tensorflow. My implementation was inspired by their "Neural Bag of Words" baseline implementation. Before the hackathon, I had written CodeSnippetSearch in Keras and it was only able to search through the CodeSearchNet dataset. Due to difficulties when developing and deploying the models, I decided to switch to PyTorch when I wanted to add support for searching GitHub repositories.

CodeSnippetSearch works by using joint embeddings of code and queries to implement a neural search system. The training objective is to map code and corresponding queries onto vectors that are close to each other. With this, we can embed a natural language query and then use nearest neighbor search to return a set of matching code snippets. During training, we use function docstrings as substitutes for natural language queries.

To learn the embeddings I combine a set of sequence encoders (weighted bag of words in this case) to encode the inputs. The loss function can be intuitively explained as maximizing the inner product between the corresponding code and query pairs while minimizing the inner product between non-corresponding pairs.

To train a repository model I simply take the model that was trained on the CodeSearchNet data, extract the embedding weights, and fine-tune them on a repository-specific dataset that was extracted separately.

I built the neural model in PyTorch and I'm using AnnoyIndex for nearest neighbor search. The web application backend is written in Django and the frontend of the web application and web extension is written in Vue.

Challenges I ran into

  • Providing fast search results
  • Using Githubs tree-sitter to parse and extract functions from Github repositories
  • How to use base models to fine-tune the repository models

Accomplishments that I'm proud of

During the project I discovered a bug in Github's CodeSearchNet. The problem was in the configuration of the nearest neighbor search. Fixing the bug bumped the final evaluation metrics by almost two times.

What I learned

Handling large machine learning tasks from preprocessing the data to deploying trained PyTorch models into production.

What's next for CodeSnippetSearch.net

  • Adding support for more programming languages
  • Adding more repositories to the web application
  • Improving the search results

Built With

Share this project:

Updates