📖 FastBook – Automating Problem-Solution Linking in PDFs

🚀 Inspiration

As a student, constantly flipping between problem sets and their solutions in textbooks was inefficient and frustrating. This led me to create FastBook, an automated tool that hyperlinks problems to their corresponding solutions within PDFs, streamlining navigation and enhancing the study experience.

🛠️ How I Built It

FastBook is a Streamlit-powered web application that processes PDFs using PyPDF2, pdfplumber, and PyMuPDF (fitz) to extract text, identify problem-solution pairs, and insert hyperlinks. The core workflow includes:

PDF Upload & Preview
- Users upload a textbook PDF, which is temporarily stored.
- PyMuPDF (fitz) renders a preview of the first few pages for verification.
Text Extraction & Parsing
- pdfplumber is used to extract text while preserving word positions.
- Regex-based parsing detects numbered problems (e.g., 1.) and differentiates problem and solution sections.
- A heuristic determines when the solution section starts (e.g., detecting a sudden drop in numbering).
Bounding Box Computation
- Extracted words include (x0, y0, x1, y1) coordinates, mapped to PDF dimensions.
- A small padding is applied to ensure clickable areas are easily accessible.
Hyperlink Injection
- PyPDF2 modifies the PDF by adding /Link annotations.
- Each problem is linked to its solution, and vice versa.
- DictionaryObjects, RectangleObjects, and ArrayObjects define link metadata.
Progress Feedback & Download
- A progress bar updates as linking progresses.
- The processed PDF is stored and made available for download.

🎯 Challenges Faced

Handling Diverse PDF Formats
- Different textbooks use various fonts, layouts, and text encodings.
- Some PDFs use images instead of selectable text, requiring OCR (not yet implemented).
Accurate Problem-Solution Detection
- Problem numbers appear in different styles (1., (1), Q1).
- Some problems span multiple lines, requiring careful text flow analysis.
Precise Link Placement
- Bounding boxes needed adjustments due to variations in font size and page scaling.
- Ensuring clickable areas were not too small or misplaced required fine-tuning.
Memory Management
- Large PDFs required efficient handling to prevent excessive memory usage.
- Streaming file operations minimized RAM consumption.

🎓 What I Learned

Deep PDF processing techniques using PyPDF2, pdfplumber, and PyMuPDF.
Text recognition and pattern matching using regex for structured data extraction.
Coordinate mapping and hyperlinking in PDFs using annotation objects.
Building user-friendly interfaces with Streamlit and handling file uploads efficiently.

FastBook 📖 makes textbooks interactive, reducing study friction and saving time. 🚀