ClauseWise - AI Legal Document Analyzer 🚀 COMPLETED FEATURES

  • Multi-Format Document Upload
  • Document Text Extraction
  • Clause Segmentation & Breakdown
  • Named Entity Recognition (NER)
  • Clause Simplification
  • Document Type Classification
  • AI Q&A ChatBot
  • Lease Timeline Visualization-CSV Download
  • Text-to-Speech (TTS)
  • DOCX Export & Reprocessing
  • IBM AI Integration
  • Multi-Format Document Upload Supported: PDF, DOCX, TXT Flow: File → Text Extraction → Processing Pipeline
  1. Document Text Extraction PDF: Selectable text + OCR fallback DOCX: Direct paragraph extraction TXT: UTF-8/Latin-1 encoding support Flow: Upload → Extract → Clean → Store

  2. Clause Segmentation & Breakdown Method: Regex patterns + heading detection Fallback: Paragraph-based segmentation Output: {id, title, text} for each clause Flow: Clean text → Pattern matching → Clause list

  3. Named Entity Recognition (NER) Entities: Parties, Dates, Money, Obligations, Emails, URLs Methods: IBM Granite + spaCy (fallback) Flow: Text → Entity extraction → DataFrame display

  4. Clause Simplification Methods: IBM Granite (primary) + Heuristic rules (fallback) Rules: Legal → Plain language conversion Flow: Clause text → AI processing → Simplified output

  5. Document Type Classification Types: NDA, Lease, Employment Agreement, Service Agreement Methods: IBM Granite + Heuristic rules Flow: Document text → Classification → Confidence score

  6. Lease Timeline Visualization Features: Start date, Term, End date extraction Flow: Lease detection → Date parsing → Timeline information

  7. Text-to-Speech (TTS) Engine: pyttsx3 (local) Usage: Play simplified clauses Flow: Text → Audio synthesis → WAV playback

  8. DOCX Export & Reprocessing Export: Full text + Clause-structured Reprocess: Upload exported DOCX → Re-analyze Flow: Extract → Export → Upload → Reprocess

  9. OCR Cleanup Pipeline Features: De-hyphenation, Line merging, Noise removal Methods: Regex + Pattern matching Flow: Raw OCR → Cleanup → Structured text

  10. IBM AI Integration Watson NLU: Entity extraction + Classification Granite: Text generation + Zero-shot classification Fallback: Local models when IBM unavailable

  11. TECHNICAL FLOW

  12. UPLOAD → File validation & type detection

  13. EXTRACT → Text extraction (native + OCR)

  14. CLEAN → Text preprocessing & normalization

  15. SEGMENT → Clause detection & breakdown

  16. ANALYZE → NER, classification, simplification

  17. VISUALIZE → Timeline diagrams (lease docs)

  18. EXPORT → DOCX generation & reprocessing 📋 USAGE INSTRUCTIONS For Scanned PDFs: Upload PDF in sidebar Enable "Force OCR" checkbox Select languages (e.g., English + Tamil) Lower "Min chars" to 20-50 Wait for OCR processing Check "Preview extracted text" For Text-based PDFs: Upload PDF Keep OCR settings default Text extracted automatically 🚨 TROUBLESHOOTING "No text extracted" Error: Check "🔧 Debug Info" expander Verify OCR engines are available Enable "Force OCR" in sidebar Lower "Min chars" threshold

Built With

Share this project:

Updates