Document CoFinder

A cross-lingual semantic search tool

Inspiration

This project was inspired by Elle Neal's work on semantic search. She is a member of the Cohere community who wrote a small tutorial on how she created a semantic search tool for the community. It got me thinking of other ways to implement a similar idea for internal documents at an organizational level.

What it does

Document CoFinder uses Cohere's multilingual embeddings and Rerank API to perform semantic searches across documents in different languages and surface the best results for a given. query.

How we built it

We used Coherer's multilingual embeddings to build a local index for the documents. Using Annoy, we performed a nearest neighbour search, then translated the top results to English for further shortlisting using Rerank. We then used Cohere's generative models to build a well-worded response to the user's query. The user interface was built and hosted using Streamlit.

Challenges we ran into

  • Rate limiting for the generative endpoints while using a trial key was a challenge.
  • Finding a good and stable translation package was also hard since the previously popular packages are no longer maintained.
  • We tried to use other open-source models, but most were either too heavy or required an overhaul of our existing prompts to come up to par with the performance we wanted.

Accomplishments that we're proud of

We got a working product that is actually useful. Outside of rate limits, it actually works really well.

What we learned

Simple can sometimes perform better than complex solutions, and it helps to start early and build incrementally.

What's next for Document Cofinder

We would like to expand the types of documents it works on beyond PDF's, explore other models, and allow users to run queries and get responses in their own languages.

Built With

  • annoy
  • cohere
  • python
  • streamlit
  • translators
+ 62 more
Share this project:

Updates