Inspiration
The inspiration behind this project stems from a desire to make legal information more accessible and user-friendly. The BOE serves as a crucial source of legal documentation in Spain, containing important laws, regulations, and government announcements. However, navigating and extracting relevant information from the BOE can be a daunting task for many individuals. By developing this Assistant, we aim to simplify the process and empower users to effortlessly find answers to their questions regarding the BOE. Whether it's understanding specific legal provisions, clarifying regulations, or accessing official announcements, this project seeks to bridge the gap between users and the vast repository of information within the BOE, ultimately enabling individuals to stay informed and make well-informed decisions with ease.
What it does
From the beginning, we wanted to create a very simple and very powerful product. It is an indexing (we want it to be in real time) of all the documents that appear every day, very structured and with a very simple interface. Precisely the power of the project is based on the simplicity shown to the end user, although internally, the extraction of data from official sources is complex.
How we built it
All BOE articles are embedded in vectors and stored in a vector database. When a question is asked, the question is embedded in the same latent space and the most relevant text is retrieved from the vector database by performing a query using the embedded question. The retrieved pieces of text are then sent to the LLM to construct an answer.
Challenges we ran into
Analyze all documents and their structure in original XML to be able to make good use of embeddings and their metadata, in order to then perform advanced searches in the vectors to maximize search times and results.
We are also studying how to cache the results.
Accomplishments that we're proud of
The system is very modular in terms of the organization of data sources and the way embeddings are inserted into Pinecone, making it easy to incorporate more people who focus on specific sources, using the metadata of the embeddings to filter by source.
What we learned
We've learned a lot about Pinecone AI from the given context. It's fascinating to see how their advanced indexing algorithms enable fast and accurate similarity search operations. The fact that Pinecone can handle billions of high-dimensional vectors and scale horizontally is truly impressive, making it an ideal platform for our goal, since all spanish bulletin corpus is huge. We have not to care about maintenance and infrastructure, allowing us to focus solely on deploying and developing.
What's next for BOE Assitant
We are evaluating and learning Real-time Data Ingestion because government bulletins can come in at different times of the day, so it would be nice to be able to enter them as soon as we get the webhook of new content.
Built With
- langchain
- pinecone
- python
Log in or sign up for Devpost to join the conversation.