Inspiration
The inspiration was my experience when I opened a small business a couple of years back, before ChatGPT went mainstream, for the first time. I had a lot of legal, financial questions, accounting questions, but I didn't really know where to start looking for an answer myself. I took my accountant's word for granted, but I wished I could've been able to research myself. For this reason I decided to create Themis.
What it does
It answers legal & fiscal questions of people in general, and business owners in particular, offering detailed answers along with exact references to the articles that were used to compose the answers. It works as a starting point for legal research. If the answer is enough and self-evident, then there's no need for the user to pay for a lawyer. If in doubt, the user can go to one any time.
How we built it
The project consists of multiple parts:
Infrastructure
Comprised of an elastic search server, kibana and an MCP tool (which will be described l)
Law Ingestion
- it takes laws, extracts individual parts and saves them to elastic search
- for every article, it identifies links to other articles, then saves bidirectional links
- for every law it creates an article index (which contains the raw article) and a segment index (which will be used for semantic + text search)
- the segments saved in the segment index are sentences, individual paragraphs and subparagraphs from the law. This segmentation is done in order to improve the accuracy of semantic searches
MCP Server
The MCP server exposes elastic search data to the AI Agent. It has 2 tools:
- themis_search_law_ro - performs the semantic + text search based on a sentence passed by the AI. The result is a list of articles alongside some metadata
- themis_find_related_articles - for a given law id + article number, it find other articles linked to the given article (using the bidirectional links mentioned above)
Challenges we ran into
Article Extraction
Themis has an inputs PDFs containing law articles. Basically, it receives unstructured data that I need to make sense of. Even more, in the beginning there wasn't a single format for laws & articles, but multiple ones. What I had to do was:
- extract the articles from a PDF, alongside metadata like article name, chapter, title
- PDFs spanned multiple pages, containing headers, footers, page numbers. these all had to be removed, they could not appear in article texts
- extract linked articles in the current article
These were all hurdles I needed to work through in order to format the data in the way I needed it to be. I initially thought of passing a law pdf to an llm and have it return a list of article text, however laws can span dozens, maybe even hundreds of pages. I worried about whether it's technically possible, but especially about the costs of doing this. In the end, I devised multiple algorithms to remove metadata such as footers / headers, page numbers, then extract article text alongside metadata such as article name, title, chapter.
E2E tests
Unit tests were not enough, I also needed tests for the agent itself, make sure the answers follow a specific pattern. However, I can't test for an exact answer, as the wording changes each time I ask the same question. After thinking, I decided I first need to test the mentioned references:
- mandatory references - that need to always be present
- optional references - connected articles that could be mentioned in an answer
- no unrelated references - I need to make sure that, when I ask a question about VAT codes, it doesn't reference an article about last wills. I do this by mentioning mandatory references, optional references, then expect no other references
Additionally, I also decided to test a response by checking if a specific idea is mentioned. Since I can't obtain this with regexes, I am using an llm to test the presence of that idea in the final response.
I decided that Cucumber was the best way to implement these tests.
Accomplishments that we're proud of
This is by far the most complicated AI project I've worked on, and I managed to get it up & running myself.
What we learned
- I learnt to work with external inference endpoints
- learnt of semantic text field (previously I used dense_vectors directly)
- learnt how to build MCP servers from scratch & how to manually test them
What's next for themis
- improve ingestion - currently the ingestion mechanism expects a pdf with the actual law. It could be significantly improved by learning how to handle the url to the law and extract data from there
- more laws - currently I manually added several laws, the system needs a lot more laws to answer a wider variety of questions
- multiple countries - the system was built first for the Romanian legal system. However, it works in the same manner for laws from other countries. This would be especially useful for SMEs looking to expand into other EU countries in order to get a starting point for their legal & accounting questions
Built With
- docker
- elasticsearch
- mcp
- openai
- python
Log in or sign up for Devpost to join the conversation.