DocuMorph AI - ERNIE Multimodal Document Transformer

Uploading PDF UI
Backend Paddle Screen 3
Backend Paddle Screen 1
Uploaded PDF UI
Backend Paddle Screen 2
Knowledge Graph

Inspiration

We were frustrated with the limitations of traditional PDF-to-HTML converters. They produce static, inaccessible output that loses all the semantic meaning of the original document. Tables remain as boring text, data stays locked away, and sensitive information gets sent to cloud services without any privacy protection.

When we discovered ERNIE's multimodal vision capabilities on Novita AI, we saw an opportunity to build something revolutionary: a system that doesn't just extract text, but actually sees and understands document structure — detecting tables that should become charts, Q&A sections that could be interactive quizzes, and timelines that deserve visualization.

We also wanted to solve the privacy problem. Why should sensitive documents be sent to cloud AI services? Our solution: process locally first, scrub PII, then only send sanitized content to the cloud.

What it does

DocuMorph AI transforms PDF documents into interactive, responsive HTML webpages through an intelligent pipeline:

1. Privacy-First Local Processing

Extracts text using PaddleOCR (runs entirely on your machine)
Detects and redacts PII (emails, phone numbers, SSNs, credit cards) using Presidio
Nothing sensitive ever leaves your computer

2. ERNIE Vision Analysis

Sends page images to ERNIE-4.5-VL (vision model)
AI "sees" the document layout and detects:
- Tables → Suggests bar/line/pie charts
- Q&A sections → Suggests interactive quizzes
- Date sequences → Suggests timeline widgets
- Locations → Suggests interactive maps

3. Human-in-the-Loop Co-Design

Review all AI suggestions before generation
Edit OCR errors, approve/reject PII redactions
Choose themes, toggle widgets on/off
You're in control, not the AI

4. Interactive HTML Generation

ERNIE-4.5 generates semantic HTML with proper structure
Chart.js visualizations for data
Interactive quiz widgets
Responsive design (desktop/tablet/mobile)
WCAG AA accessible

5. One-Click Deployment

Deploy to GitHub Pages, Netlify, Vercel, or AWS S3

How we built it

Architecture

┌─────────────────────────────────────────────────────────────┐
│                 LOCAL PROCESSING (Privacy Zone)              │
│  PDF → PaddleOCR → Presidio PII Detection → Content Blocks  │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│              CO-DESIGN LAYER (Human Review)                  │
│  Edit blocks • Approve PII • Select theme • Choose widgets  │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                CLOUD AI (Novita AI - ERNIE)                  │
│  ERNIE-4.5 (Text) → Theme analysis, HTML generation         │
│  ERNIE-4.5-VL (Vision) → Page image analysis, detection     │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                      OUTPUT                                  │
│  Interactive HTML • Charts • Quizzes • Responsive Design    │
└─────────────────────────────────────────────────────────────┘

Tech Stack

Layer	Technologies
Backend	FastAPI, Python 3.11, Uvicorn
OCR	PaddleOCR (local processing)
PII Detection	Microsoft Presidio + spaCy
AI Models	ERNIE-4.5 + ERNIE-4.5-VL via Novita AI
Frontend	React 18, TypeScript, Vite, Tailwind CSS
Real-time	WebSocket for live progress updates
Visualization	Chart.js, Recharts

ERNIE Model Integration

We use two ERNIE models for true multimodal processing:

ERNIE-4.5-21B (Text Model)

Analyzes document content for theme suggestions
Generates semantic HTML structure
Creates accessible, responsive output

ERNIE-4.5-VL-28B (Vision Model)

Analyzes PDF page images (not just text!)
Detects visual patterns: tables, charts, Q&A layouts
Suggests appropriate widget transformations

Real API response from our vision analysis: Page 0 - Quiz Detection:

{
  "tables": [],
  "quizzes": [{"question_count": 2, "type": "multiple_choice|true_false"}],
  "timelines": [],
  "maps": [],
  "confidence": 1.0,
  "description": "A quiz document with multiple choice and true/false questions about CPU and Python."
}

Page 1 - Timeline Detection:

{
  "tables": [{"chart_type": "line", "data_summary": "The data shows chronological project milestones over time, from project kickoff to final release."}],
  "quizzes": [],
  "timelines": [{"event_count": 5}],
  "maps": [],
  "confidence": 1.0,
  "description": "The page contains a project timeline with dates and milestones for a project from kickoff to final release."
}

Page 2 - Geographic Location Detection:

{
  "tables": [],
  "quizzes": [],
  "timelines": [],
  "maps": [{"location_count": 4}],
  "confidence": 1.0,
  "description": "The page lists geographic locations and addresses of offices in Bengaluru, Mumbai, New York, and London."
}

Challenges we ran into

1. PaddleOCR Compatibility The newer PaddleX version changed APIs significantly. We had to implement fallback methods and timeout protection to handle different versions gracefully.

2. Real-Time Processing Coordinating WebSocket updates across OCR, PII detection, and AI analysis required careful async handling to provide smooth progress feedback.

3. Vision Model Prompt Engineering Getting ERNIE-VL to return structured JSON for component detection took iteration. We needed specific prompts to identify tables vs. text blocks reliably.

4. Privacy vs. Functionality Balance We wanted maximum privacy (local processing) but also needed cloud AI capabilities. The solution: process and sanitize locally, only send cleaned content to ERNIE.

Accomplishments that we're proud of

✅ True Multimodal AI Usage Not just text — we actually use ERNIE Vision to analyze page images and detect visual patterns.

Privacy-First Architecture Sensitive documents stay local. PII is scrubbed before any cloud calls.

** Complete Working Pipeline** End-to-end: Upload → OCR → PII → Vision Analysis → HTML Generation → Deployment

** Human-in-the-Loop Design** AI suggests, humans decide. No black-box automation.

** Real-Time Feedback** WebSocket updates show exactly what's happening at each step.

🧠 Auto-Generated Knowledge Graph

This feature alone separates DocuMorph AI from 90% of PDF converters. Instead of just extracting text, we build an interactive knowledge graph that shows how concepts, sections, and entities in your document relate to each other. Every knowledge graph is generated live using ERNIE AI. Here's actual server logs from a real document processing:

2026-01-01 22:16:07 | INFO | Generating knowledge graph for document 2bb7c96c-6602-4684-b54b-ebcb53438b97
2026-01-01 22:16:24 | INFO | Generated graph with 23 nodes and 16 edges
POST /api/knowledge-graph/2bb7c96c-6602-4684-b54b-ebcb53438b97/generate HTTP/1.1 200 OK
POST /api/knowledge-graph/2bb7c96c-6602-4684-b54b-ebcb53438b97/simplify HTTP/1.1 200 OK

The AI extracts entities and relationships in ~17 seconds, creating a fully interactive graph visualization.

How It Works

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KNOWLEDGE GRAPH PIPELINE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  PDF Document                                                               │
│      │                                                                      │
│      ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  ENTITY EXTRACTION (ERNIE AI)                                       │   │
│  │  ─────────────────────────────                                      │   │
│  │  • Sections: Chapter 1, Section 2.1, etc.                           │   │
│  │  • Concepts: Key terms, theories, definitions                       │   │
│  │  • People: Authors, researchers, historical figures                 │   │
│  │  • Dates: Important dates, time periods                             │   │
│  │  • Locations: Places, countries, cities                             │   │
│  │  • Tables/Figures: Data summaries                                   │   │
│  │  • Organizations: Companies, institutions                           │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│                              ▼                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  RELATIONSHIP DETECTION (AI-Powered)                                │   │
│  │  ─────────────────────────────────────                              │   │
│  │  • references: "See Section 3 for details"                          │   │
│  │  • builds_on: "Building on concepts from Chapter 1"                 │   │
│  │  • summarizes: "Table X summarizes data from Section Y"             │   │
│  │  • defines: Section defines a concept                               │   │
│  │  • contains: Parent-child hierarchy                                 │   │
│  │  • related_to: Semantic similarity                                  │   │
│  │  • contrasts / supports: Argument relationships                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│                              ▼                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  INTERACTIVE VISUALIZATION (vis.js Compatible)                      │   │
│  │  ─────────────────────────────────────────────                      │   │
│  │  • Color-coded nodes by entity type                                 │   │
│  │  • Force-directed layout for natural clustering                     │   │
│  │  • Click-to-navigate: Jump to any section instantly                 │   │
│  │  • Hover tooltips with full context                                 │   │
│  │  • Simplify mode for co-design preview                              │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Why This Matters

Traditional PDF converters give you a flat Table of Contents. We give you a semantic map of your document:

Traditional ToC	DocuMorph Knowledge Graph
Linear list of sections	Interactive node-edge visualization
No relationships shown	AI-detected references, dependencies
Static navigation	Click any node to jump to content
Text-only	Color-coded by entity type
Manual creation	Auto-generated by ERNIE AI

Real API Response

When we process a research paper, the knowledge graph API returns real data. Here's actual server output:

2026-01-01 22:16:07 | INFO | Generating knowledge graph for document 2bb7c96c-...
2026-01-01 22:16:24 | INFO | Generated graph with 23 nodes and 16 edges

{
  "document_id": "abc123",
  "nodes": [
    {"id": "entity_section_1", "label": "Introduction", "type": "section", "color": "#4e79a7"},
    {"id": "entity_ai_0", "label": "Machine Learning", "type": "concept", "color": "#f28e2c"},
    {"id": "entity_ai_1", "label": "Neural Networks", "type": "concept", "color": "#f28e2c"},
    {"id": "entity_date_1234", "label": "January 2024", "type": "date", "color": "#76b7b2"}
  ],
  "edges": [
    {"from": "entity_section_1", "to": "entity_ai_0", "type": "defines", "label": "defines"},
    {"from": "entity_ai_1", "to": "entity_ai_0", "type": "builds_on", "label": "builds on"}
  ],
  "metadata": {
    "total_nodes": 24,
    "total_edges": 31,
    "entity_types": ["section", "concept", "person", "date", "table"],
    "relationship_types": ["contains", "precedes", "references", "builds_on", "defines"]
  }
}

Entity Types & Visual Encoding

Entity Type	Color	Shape	Example
Section	Blue (#4e79a7)	Box	"Chapter 1: Introduction"
Concept	Orange (#f28e2c)	Ellipse	"Machine Learning"
Person	Red (#e15759)	Circle	"Alan Turing"
Date	Teal (#76b7b2)	Diamond	"January 2024"
Location	Green (#59a14f)	Triangle	"San Francisco"
Table	Yellow (#edc949)	Square	"Table 3: Results"
Organization	Brown (#9c755f)	Database	"OpenAI"

Human-in-the-Loop: Simplify Before Export

Users can preview and simplify the graph in the co-design layer:

# Simplify to top 15 nodes, focusing on sections and concepts
POST /api/knowledge-graph/{id}/simplify
{
  "max_nodes": 15,
  "entity_types": ["section", "concept"]
}

This ensures the final HTML doesn't overwhelm readers with a 100-node graph — they see a clean, navigable overview.

Frontend Integration

The React frontend renders the knowledge graph using vis.js with click-to-navigate:

// Click any node to jump to that section in the document
network.on('click', (params) => {
  if (params.nodes.length > 0) {
    const node = graph.nodes.find(n => n.id === params.nodes[0]);
    if (node.data.block_id) {
      document.getElementById(node.data.block_id).scrollIntoView({ behavior: 'smooth' });
    }
  }
});

What we learned

ERNIE Vision is powerful: It can understand document layouts, not just read text
Privacy and AI can coexist: Local preprocessing + cloud AI is a viable pattern
Human oversight matters: AI suggestions with human approval beats full automation
Novita AI makes ERNIE accessible: Easy API access with generous free credits