Document Intelligence for Hitachi (DIH)

Inspiration

The challenge of regulatory compliance in document management inspired us to build DIH. Organizations struggle with manually classifying thousands of documents into compliance categories (Public, Confidential, Highly Sensitive, Unsafe), which is time-consuming, error-prone, and doesn't scale. We saw an opportunity to leverage AI and OCR technology to automate this process while maintaining transparency through evidence-based classification with page-level citations.

What it does

DIH is an AI-powered Streamlit application that automatically classifies documents for regulatory compliance. It uses OCR to extract text from PDFs and images, then employs LLM analysis to categorize documents into four compliance levels: Public, Confidential, Highly Sensitive, or Unsafe. The system provides evidence-based classification with page-level citations, showing exactly which excerpts and page numbers influenced each decision. Key features include:

  • Interactive and Batch Processing: Analyze single documents or process multiple files simultaneously
  • Configurable Rule System: YAML-based prompt library allows organizations to customize classification criteria
  • Evidence-Based Results: Every classification includes reasoning, confidence scores, and specific page-level evidence with region citations
  • Multi-Modal Support: Handles PDFs, images (PNG, JPG), and text files
  • Comprehensive Content Safety Evaluation: Advanced multi-layered detection for child safety, hate speech, exploitative content, violence, criminal activity, political news, and cyber-threats
  • Confidence-Based HITL Reduction: Automatic routing of high-confidence classifications while flagging uncertain cases for human review
  • Audit-Ready Reports: Detailed classification reports with full reasoning, evidence logs, and exportable formats for compliance audits

How we built it

We built DIH using a Python-based architecture with the following key components:

Frontend: Streamlit for rapid UI development and interactive document upload/classification interface

OCR & Text Extraction:

  • Tesseract OCR for extracting text from PDF pages and images
  • pdf2image for converting PDF pages to images
  • PIL/Pillow for image processing

AI/ML Pipeline:

  • Primary Classification Model: GPT-4o-mini (OpenRouter API) - Selected for optimal balance of accuracy, speed, and cost-effectiveness. This lightweight model provides excellent classification performance while maintaining fast processing times and lower API costs compared to larger models.
  • Optional Dual-LLM Consensus: GPT-4-turbo available for high-stakes classifications requiring dual-model validation
  • Sentence Transformers (all-MiniLM-L6-v2) for semantic similarity matching in unsafe content detection
  • Computer Vision (OpenCV) for blood/violence detection in images with region highlighting
  • Better Profanity library for explicit language detection
  • CLIP (ViT-B-32) for image-based content safety evaluation

Rule Engine:

  • YAML-based prompt library for configurable classification rules
  • Dynamic prompt construction that incorporates document text and classification criteria
  • JSON-structured responses with evidence logs

Backend Processing:

  • Batch processing pipeline for multiple documents
  • Evidence extraction and citation tracking
  • Confidence scoring and reasoning generation

The system processes documents through a multi-stage pipeline: OCR extraction → LLM analysis → Evidence compilation → Classification output with citations.

Challenges we ran into

  1. Classification Accuracy & Precision: Achieving high precision/recall on test cases required careful prompt engineering and validation. We addressed this by implementing structured evidence logs with page-level citations and region-level citations for images, ensuring every classification decision is traceable.

  2. OCR Accuracy & Region Citations: Scanned documents with poor quality resulted in incomplete text extraction. We implemented page-level readability checks and developed a citation system that tracks both page numbers and specific text excerpts/regions that influenced classification.

  3. HITL Reduction: Balancing automation with accuracy required sophisticated confidence scoring. We implemented confidence thresholds (0-100%) and optional dual-LLM consensus mode to automatically route high-confidence cases while flagging uncertain classifications for human review.

  4. Processing Speed with Lightweight Models: Optimizing for speed while maintaining accuracy led us to select GPT-4o-mini as our primary model. We implemented efficient text truncation, batch processing optimizations, and lightweight sentence transformer models to meet business SLA requirements.

  5. LLM Response Parsing: Getting consistent JSON-structured responses with reasoning was challenging. We implemented robust JSON extraction with regex fallbacks, ensuring the reasoning module always provides clear explanations for categorization decisions.

  6. Content Safety Evaluation: Comprehensive detection of all safety categories (child safety, hate speech, exploitative, violent, criminal, political news, cyber-threats) required multi-layered detection combining semantic similarity, computer vision, and keyword matching with fine-tuned thresholds.

  7. Audit-Ready Reporting: Creating clear, exportable reports with region highlights for images required careful UI design and data structure planning to ensure all evidence is visible and traceable.

Accomplishments that we're proud of

Classification Accuracy (50% Weight)

  1. Precision/Recall Optimization: Achieved high accuracy on test cases through structured prompt engineering and evidence-based classification with page-level and region-level citations
  2. Correct Category Mapping: Implemented YAML-based rule system ensuring accurate mapping to Public, Confidential, Highly Sensitive, and Unsafe categories
  3. Clear Citations: Every classification includes specific page numbers, text excerpts, and image region highlights that influenced the decision

Reducing HITL Involvement (20% Weight)

  1. Confidence Scoring System: Implemented 0-100% confidence scores for every classification, enabling automatic routing of high-confidence cases (>70%) while flagging uncertain cases for review
  2. Dual-LLM Consensus Mode: Optional GPT-4-turbo validation for critical classifications, reducing false positives and manual review needs
  3. Clear Reviewer Queue: Batch mode automatically identifies and prioritizes documents requiring human review based on confidence thresholds
  4. Manual Review Time Reduction: Automated processing of high-confidence cases significantly reduces manual review workload

Processing Speed (10% Weight)

  1. Lightweight Model Selection: Primary Model: GPT-4o-mini - Selected for optimal balance of accuracy, speed (fast inference), and cost-effectiveness. This model provides excellent classification performance while maintaining sub-minute processing times for typical documents.
  2. Efficient Pipeline: Optimized OCR, text extraction, and LLM calls to meet business SLA requirements for both interactive and batch modes
  3. Cost-Effective Architecture: Lightweight sentence transformer models (all-MiniLM-L6-v2) for semantic matching, reducing overall processing costs

User Experience & UI (10% Weight)

  1. Clear Explanations: Every classification includes a detailed summary and reasoning module explaining why the document was categorized in the respective category
  2. Audit-Ready Reports: Exportable CSV reports with full evidence logs, confidence scores, and reasoning for compliance audits
  3. Region Highlights: Image-based classifications include visual region highlighting for unsafe content detection
  4. Straightforward File Management: Intuitive upload interface supporting drag-and-drop, batch processing, and clear file status indicators

Content Safety Evaluation (10% Weight)

  1. Comprehensive Safety Detection: Multi-layered system detecting all required categories:
    • Child Safety: Semantic similarity + keyword detection for CSAM and exploitation content
    • Hate Speech: Sentence transformer matching for protected group targeting
    • Exploitative Content: Pattern recognition for exploitation indicators
    • Violence: Computer vision (OpenCV blood detection) + semantic analysis for graphic violence
    • Criminal Activity: Detection of illegal activity instructions
    • Political News: Identification of current political/election content
    • Cyber-Threats: Pattern matching for malware, exploits, and security threats
  2. Always-On Safety Validation: Every document is automatically validated for safety before final classification

What we learned

  1. Model Selection Matters: GPT-4o-mini proved to be the sweet spot for our use case - providing excellent accuracy while maintaining fast processing times and cost-effectiveness. Larger models didn't provide proportional accuracy gains to justify the increased cost and latency.

  2. Evidence-Based Classification is Essential: Building transparency with page-level citations and reasoning from the start made the system auditable and trustworthy. Users need to understand why a document was classified a certain way, not just the classification itself.

  3. Confidence Scoring Enables Automation: Implementing robust confidence scoring allowed us to automatically process high-confidence cases while intelligently flagging uncertain ones, significantly reducing HITL involvement.

  4. Multi-Layered Safety Detection: Combining semantic similarity, computer vision, and keyword matching provided more robust safety detection than any single method alone. Each layer catches different types of unsafe content.

  5. Prompt Engineering for Reasoning: Crafting prompts that explicitly request reasoning and evidence citations resulted in more reliable and auditable classifications. The YAML-based approach allows iterative refinement without code changes.

  6. OCR Quality Impacts Everything: Implementing readability checks early in the pipeline prevents downstream classification errors. Poor OCR quality requires fallback mechanisms and clear user communication.

  7. Batch Processing Optimization: Efficient batch processing requires careful resource management, progress indicators, and error handling to maintain user experience while processing multiple documents.

What's next for Document Intelligence for Hitachi (DIH)

  1. Enhanced HITL Workflow: Implement a complete feedback loop where human reviewers can correct classifications, with this feedback used to fine-tune models and improve accuracy over time, further reducing manual review needs.

  2. Model Fine-Tuning: Fine-tune GPT-4o-mini on Hitachi's domain-specific labeled data to improve precision/recall for their specific compliance requirements.

  3. Advanced Vision Models: Integrate YOLO for object detection to improve violence and weapon detection in images, enhancing region-level citations.

  4. Real-Time API Integration: Build REST APIs to integrate DIH with existing document management systems for seamless workflow integration.

  5. Enhanced Reporting Dashboard: Develop comprehensive compliance dashboards with historical tracking, trend analysis, and automated regulatory audit report generation.

  6. Multi-Language Support: Extend OCR and classification capabilities to support multiple languages for global compliance requirements.

  7. Performance Optimization: Implement caching, parallel processing, and optimized OCR settings to further reduce processing time for large document batches.

  8. Continuous Learning: Implement active learning to automatically identify edge cases and improve model performance based on reviewer feedback.

Submission Details

Model Used

Primary Classification Model: GPT-4o-mini (via OpenRouter API)

  • Selected for optimal balance of accuracy, speed, and cost-effectiveness
  • Provides excellent classification performance with fast inference times
  • Cost-effective for batch processing while maintaining high accuracy

Supporting Models:

  • Sentence Transformers: all-MiniLM-L6-v2 (semantic similarity)
  • CLIP: ViT-B-32 (image-based content safety)
  • Tesseract OCR: Text extraction from documents

Reasoning Module

Every classification includes:

  • Summary: Brief overview of the document and classification decision
  • Reasoning: Detailed explanation of why the document was categorized in the respective category
  • Evidence Log: Page-level citations with specific excerpts and text snippets that influenced the decision
  • Confidence Score: 0-100% confidence rating for the classification

Demo Video

The end-to-end demo video showcases:

  1. Document upload (single and batch modes)
  2. Real-time processing with progress indicators
  3. Classification results with reasoning and evidence
  4. Evidence log display with page-level citations
  5. Export functionality for audit-ready reports
  6. Content safety evaluation results
  7. Confidence-based routing demonstration

Built With

  • and
  • apis
  • better-profanity
  • clip-(vit-b-32)
  • cloud-services
  • data
  • databases
  • evidence
  • for
  • frameworks
  • gpt-4-turbo)
  • json
  • logging
  • opencv
  • openrouter
  • openrouter-api-(gpt-4o-mini
  • pandas
  • pdf2image
  • pil/pillow
  • platforms
  • processing
  • pypdf2
  • python
  • sentence-transformers-(all-minilm-l6-v2)
  • streamlit
  • tesseract-ocr
  • yaml
Share this project:

Updates