Inspiration
In the age of AI, is it almost impossible for ordinary users to prevent their private information from being used for model training. That poses a trust problem. We wanted to create a way to make AI workflows safer.
What it does
Ink identifies sensitive information in PDF files, including information contained in scanned pages. It allows the user to review all detected items, keep the ones that should remain, and redact the rest. The output is a safe-to-use redacted PDF plus a key that enables restoration. When the user needs the original information back, they can upload the redacted file and the key, and the app restores the content.
How we built it
We combined PDF text extraction, OCR, AI-based entity recognition, and regex-based pattern matching into a single local workflow. The application analyzes text and images locally, presents detections in a review panel, and then applies redaction directly to the PDF. A paired restoration flow uses the saved key to rebuild the original document.
Challenges we ran into
PDF is a format with many internal variations. This makes it difficult to build an application that can standardize extracted text and correctly identify the information that needs to be redacted. Later, the redacted information must also be restored correctly while preserving the document’s original appearance. This became an even greater challenge because, although the team had experience with JavaScript, TypeScript was new to us.
Accomplishments that we're proud of
We take pride in how we combined PDF text extraction across multiple PDF subformats and made both text-based and OCR-based redaction work reliably. WWe also ensured that the entire workflow of the application runs locally,, so all uploaded data remains with the user, while still providing the ability to reverse the redactions.
What we learned
The most valuable thing we learned was how PDFs are constructed and how to process them at low level. We learned how to interact with streams and how to redact content in a way that still allows it to be restored to its original form. In addition, we gained more technical experience working with Git, TypeScript, JavaScript, and several libraries for manipulating PDF files in different ways.
What's next for Ink
For this project, we would like to add a way for users to manually select text or parts of an image for redaction, in addition to relying on the transformer model to automatically identify sensitive information. This would help ensure that everything the user intends to hide is protected from external AI models.
Built With
- gliner
- hugging-face-transformers
- pdf-lib
- pdf.js
- react
- tailwind-css
- tesseract.js
- typescript
- vite
- zustand
Log in or sign up for Devpost to join the conversation.