A cross-lingual semantic search tool
This project was inspired by Elle Neal's work on semantic search. She is a member of the Cohere community who wrote a small tutorial on how she created a semantic search tool for the community. It got me thinking of other ways to implement a similar idea for internal documents at an organizational level.
What it does
Document CoFinder uses Cohere's multilingual embeddings and Rerank API to perform semantic searches across documents in different languages and surface the best results for a given. query.
How we built it
We used Coherer's multilingual embeddings to build a local index for the documents. Using Annoy, we performed a nearest neighbour search, then translated the top results to English for further shortlisting using Rerank. We then used Cohere's generative models to build a well-worded response to the user's query. The user interface was built and hosted using Streamlit.
Challenges we ran into
- Rate limiting for the generative endpoints while using a trial key was a challenge.
- Finding a good and stable translation package was also hard since the previously popular packages are no longer maintained.
- We tried to use other open-source models, but most were either too heavy or required an overhaul of our existing prompts to come up to par with the performance we wanted.
Accomplishments that we're proud of
We got a working product that is actually useful. Outside of rate limits, it actually works really well.
What we learned
Simple can sometimes perform better than complex solutions, and it helps to start early and build incrementally.
What's next for Document Cofinder
We would like to expand the types of documents it works on beyond PDF's, explore other models, and allow users to run queries and get responses in their own languages.
Log in or sign up for Devpost to join the conversation.