Complyr

Inspiration 🌠

Navigating the sea of data to ensure GDPR compliance is a herculean task for banks. The urgency to flag and handle sensitive information accurately and efficiently served as the catalyst for Complyr. We aimed to tackle the need for automated solutions in a world where manual efforts just can't keep up.

What it does 🤖

Complyr is a robust data crawler designed to sift through a filesystem filled with varied data types, identifying and flagging files that contain sensitive information. Using multiple techniques like Named Entity Recognition (NER) and regular expressions, Complyr sorts files into three categories: compliant ✅, non-compliant ❌, and those that require human review 👀.

How we built it 🔨

Our journey was a two-pronged approach. First, we tackled data preparation 📊, facing an array of over 15 different data types in our sample of ~800 files. We developed specialized functions to read, clean, and prepare these files for classification. Next came the machine learning model 🧠. We opted for an ensemble approach using SpaCy for NER, TokenizerClassified from transformers, and good old Regex for certain specific fields. All of this was neatly packaged into a Docker container 🐳 for easy deployment.

Challenges we ran into 🛠️

The sheer variety of data types and formats was a significant hurdle 🚧. Another challenge was optimizing the trade-off between accuracy and efficiency ⏱️, particularly when chaining multiple techniques together, like NER and Regex.

Accomplishments that we're proud of 🏆

We're thrilled that we managed to build a working prototype in a time-crunched environment ⏰. The modular code design allows for the easy addition of new classifiers and tools 🛠️, showing promise for future scalability. Additionally, our choice to blend multiple techniques allowed for a more accurate identification of sensitive data 👍.

What we learned 🎓

We gained insights into the complexities of data privacy laws like GDPR 📜 and the challenges banks face in becoming compliant. On the technical side, we learned the power of combining various NLP and machine learning techniques to create a more robust solution 💪.

What's next for Complyr 🚀

The road ahead involves tuning our existing models for higher accuracy 🎯 and incorporating additional machine learning algorithms to expand the range of data types we can process 📈. We also plan to explore real-world applications and scalability options for Complyr 🌐.

Built With

Submitted to

HackZurich 2023

Created by

I mostly worked on data analysis, NER model and Regex matching. NER was a new thing to us. Optimizing the app was also very difficult. The challenge is well defined and covers many topics of CS. It was very challenging and we learned a lot.

Mustafa Soner IŞIK
I focused on data preprocessing and file crawling, working with various datatypes to clean and prepare data for our classification model. In parallel, I collaborated with the team on other tasks to fine-tune our solution.

Volodymyr Moskalenko
I was working on leveraging `transformers` python library and picking appropriate pretrained NER model to identify PII in text. Additionally, did my best to help structuring the code into modular pieces

Vasyl Haievyi
ekke Aichholzer
Jakob Spohler

Updates

Volodymyr Moskalenko started this project — Sep 17, 2023 02:06 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.