DuoCrypt

Inspiration

As academic students and professionals who have worked with sensitive data in financial institutions, our group has experience leveraging AI models to automate workflows and increase efficiency while working or while studying. However, this often comes at the expense of privacy, as sensitive information is fed into AI systems and stored on their servers as long-term memory. This poses significant risks to data security. With AI increasingly adopted in workplaces for automation, there is a pressing need for a model that can censor personally identifiable information (PII) without altering the core meaning of a document. Such a model would allow AI systems to treat PII as abstract variables, enabling workflow automation without ever accessing the actual sensitive content. While researching, we realised that most existing privacy tools simply redact or mask PII, which makes text unreadable and unusable for downstream AI models. DuoCrypt instead encrypts PIIs, keeping them safe but still structurally present in the text. Since the encrypted tokens are still embedded in the text, AI systems are able to understand the context of the documents and run their system analysis without being “blinded” by redactions.

What it does

Our model encrypts sensitive PII in documents while preserving all non-sensitive content, maintaining readability and context for secure data handling and AI processing.

How we built it

https://huggingface.co/iiiorg/piiranha-v1-detect-personal-information We drew inspiration from an existing model fine-tuned from Microsoft’s DeBERTa model provided on HuggingFace. This particular model has the ability to identify a specific set of PIIs (name, contact number, location etc.), before redacting the said identified PII into specific categories. We decided to enhance the model by creating an encryption feature instead of redacting PIIs. The model identifies and encrypts detected PII using AES-256, converting it into unreadable ciphertext. Authorised users are also able to decrypt the ciphertext and reveal the original PIIs by using the same 32 bit key and 16 bit pseudorandom initialisation vector (IV) used to encrypt the plaintext.

Challenges we ran into

Deciding how to fine tune the model

After deciding on the project and topic, we delved into research about PIIs and uncovered many huggingface models that was already able to detect and mask PIIs. This disheartened us as we felt that what we wanted to create was already created. However, after analysing the project requirements we discovered that we could build on existing models and decided to research about the features we wanted to implement to improve the existing model.

Limited free GPU testing

Google Colab has limited free GPU testing and this made us lose our code several times because we kept getting disconnected. To resolve this, we just kept saving our work and switching Google accounts to access the free runtime to test our code.

Accomplishments that we're proud of

Our model has the potential to be picked and used by a wide range of organisations who work with millions of sensitive PIIs ranging from financial groups to even the military who runs AI simulated war gaming to evaluate war tactics. The versatility of DuoCrypt allows for privacy of PIIs while allowing for specific AI models to run its systems.

What we learned

Through our research, we gained a deep understanding of PIIs and its relevance to AI today. PIIs contain sensitive information and our reliance on AI has reinforced the belief that humans are considered the weakest link in security due to human error. On the flipside we were also exposed to the process of how LLMs like OpenAI’s ChatGPT stores our user data/input in its long term memory for personalisation. The long term memory data is stored in OpenAI’s servers and could potentially be vulnerable to attacks such as prompt injection or LLM poisoning. This shows that threats are constantly evolving and that we need to be more vigilant against attacks that make us reveal our data.

What's next for DuoCrypt

We aim to expand the prototype to support customizable PII categories, allowing users to easily choose which data to encrypt. A simple, intuitive UI will empower non-technical users to protect sensitive information without hassle, making privacy accessible to everyone.