Document logical structure extraction from certain types of pdf documents. Not any and all pdf docs! This is a very heuristic, rule-based approach to extracting logical structure such as section hierarchy; figure & table objects and their captions; headers & footers. Currently targeted to single-column technical documents. Uses pdfplumber for raw pdf extraction, then table detection & parsing relies on camelot. Important for RAG to extract meaningful chunks both for retrieval and LLM context. Not yet released---would love more examples from real use cases.

Built With

Share this project:

Updates