As students navigating hackathons, internships, and early-stage startups, we constantly ran into dense legal documents—NDAs, term sheets, and incorporation papers—that were difficult to interpret without legal help. We thought: what if we could build an AI agent that reads legal docs for you, finds what matters, and explains it clearly—instantly? That's how ClauseSense was born.

It is an AI-powered legal document analyzer that ingests legal files, retrieves relevant clauses using semantic search, and simplifies complex legal language into actionable, plain-English reports.

Document Ingestion: We used python-docx to extract text from model legal documents (Voting Agreement, SPA, IRA, etc.).

Clause Segmentation: Using regex and heuristics based on legal structure (e.g., "WHEREAS", "Section"), we split long documents into clause-sized chunks.

Vectorization & Indexing (RAG):

We embedded all clauses using sentence-transformers (all-MiniLM-L6-v2) and indexed them in FAISS.

User queries are embedded and matched against this index to retrieve semantically similar clauses.

Simplifier Agent: Each retrieved clause is passed to a T5-based summarizer, which simplifies complex legal language into understandable text.

FastAPI Backend: We wrapped the logic into an API endpoint (/report) that accepts queries and returns legal reports.

Frontend Interface: We built a lightweight HTML/JavaScript UI that interacts with the API, lets users type legal questions, and displays the resulting reports.

Challenges we ran into

Model Accuracy: Pretrained models often hallucinated or oversimplified legal clauses. We plan to fine-tune the summarizer on domain-specific datasets for better results.

Document Parsing Complexity: Legal documents don’t follow consistent formatting. Segmenting them cleanly without breaking context was tricky.

CORS and Frontend Integration: Connecting a local HTML interface to our FastAPI backend required handling CORS issues and proper request formatting

Result

ClauseSense can take any legal document (in .docx format), extract clauses, semantically retrieve the most relevant ones for a user’s query, and rewrite them in simpler language—all within seconds.

Built With

Share this project:

Updates