Google is a huge company based on traditional information retrieval technology: it currently returns a list of webpages, and then you often still have to hunt for the information you’re looking for. In the future, I believe information retrieval will be more conversational: the search engine will be able to understand your questions better than it does now and directly reply to you with an answer.
I met someone a few months ago at a hack night at the HF0 house in Taiwan who told me that ColBERT is (was?) the state of the art language model for these sorts of applications, so I wanted to make a project to try it out and see how well it works.
What it does
Answers questions about Miami Hack Week based on a corpus of messages from our Slack workspace.
How we built it
- Built a Slack scraper in Python to dump messages from the Miami Hack Week Slack workspace into a sqlite database
- Downloaded the ColBERT code from https://github.com/stanford-futuredata/ColBERT
- Downloaded a pre-trained ColBERT model from https://huggingface.co/vespa-engine/colbert-medium/tree/main
- Wrote some glue code to index the Slack messages and a frontend to do retrieval
Challenges we ran into
- The pretrained model was not quite in the format the the ColBERT code was expecting and it took quite a bit of messing around to get it working.
- My MacBook Pro can’t run CUDA, so I had to comment out a bunch of CUDA code to get the model running on my CPU.
Accomplishments that we're proud of
I think the result quality is reasonably good, and would be even better with a larger corpus of messages.
What we learned
- Maybe I would’ve been better off working in a Colab or a VM or something that can run CUDA
- Using pretrained models isn’t always as straightforward as one might hope
- If I’d partied less, I maybe could’ve built a prettier frontend
What's next for Miami Hack Week 2021 Q&A bot
Couple possible directions if people think this shows promise:
- Finish the Slack app and embed it into the Slack UI
- Build a version for Notion documents
- (long shot) Try to make the language model more conversational, like GPT-3, and build a search engine