Inspiration

My file system is a mess! I store things in weird places and I name files finalreport_finalFINAL.pdf. I haven't been able to find a file search app that works for what I want, specifically terminal based semantic search, so I decided to create it myself!

What it does

Findir performs semantic search in your file system through your terminal. Set which directories you'd like to track, and it will index all the files into a vector database. Search based on meaning or description, instead of file name or location! Indexes will be updated automatically when you add or change files in tracked folders.

How I built it

The terminal app is built in Go using the bubbletea ui framework. Text embeddings are done using Python and Hugging Face's MiniLM-L6-v2 a small but high performing sentence transformer model. Search results are found by comparing vector similarity between the search phrase and indexed embeddings. Indexes are updated upon new file writes using the Linux inotify subsystem in a lightweight daemon also written in Go.

I also implemented different custom parsers for different file types, detailed below.

Plaintext (.txt, .log, .csv)

Indexes text data directly with minimal cleaning and trimming.

Markdown (.md)

Indexes text content, stripping formatting syntax using AST generated by goldmark.

PDF (.pdf)

Uses poppler-utils for robust pdf text extraction. Scanned pdfs requiring OCR cannot be indexed.

Code (.c, .h, .cpp, .cc, .cxx, .hpp, .hxx, .java, .js, .jsx, .go, .rs)

Extracts and indexes definitions for method/functions, classes, and structs as well as comments for the languages above. Parses code files using Tree-sitter.

Markup (.html, .html, .xhtml, .xml, .svg, .docx, .pptx, .xlsx)

Extracts and indexes text content from tags, alt text, and metadata. Microsoft Office formats (.docx, .pptx, .xlsx) are unzipped before parsing.

Latex (.tex)

Indexes text content after stripping math and commands using regex.

Challenges I ran into

The main challenge was coming up with the architecture! There were many different options for languages and approaches, all with their own tradeoffs. I ended up choosing Go for ease of implementation and robust TUI framework and Python for mature ML libraries.

Accomplishments that I'm proud of

I'm proud of completing my first solo hackathon, and making a project I will definitely use myself!

What we learned

I learned about embeddings and vector similarity, how to make a persistent daemon in Go, and how to design an efficient app for an open ended problem.

What's next for Findir

  • More file formats
  • Better code search
  • Improved UX

Built With

Share this project:

Updates