Navigating the sea of data to ensure GDPR compliance is a herculean task for banks. The urgency to flag and handle sensitive information accurately and efficiently served as the catalyst for Complyr. We aimed to tackle the need for automated solutions in a world where manual efforts just can't keep up.
What it does 🤖
Complyr is a robust data crawler designed to sift through a filesystem filled with varied data types, identifying and flagging files that contain sensitive information. Using multiple techniques like Named Entity Recognition (NER) and regular expressions, Complyr sorts files into three categories: compliant ✅, non-compliant ❌, and those that require human review 👀.
How we built it 🔨
Our journey was a two-pronged approach. First, we tackled data preparation 📊, facing an array of over 15 different data types in our sample of ~800 files. We developed specialized functions to read, clean, and prepare these files for classification. Next came the machine learning model 🧠. We opted for an ensemble approach using SpaCy for NER, TokenizerClassified from transformers, and good old Regex for certain specific fields. All of this was neatly packaged into a Docker container 🐳 for easy deployment.
Challenges we ran into 🛠️
The sheer variety of data types and formats was a significant hurdle 🚧. Another challenge was optimizing the trade-off between accuracy and efficiency ⏱️, particularly when chaining multiple techniques together, like NER and Regex.
Accomplishments that we're proud of 🏆
We're thrilled that we managed to build a working prototype in a time-crunched environment ⏰. The modular code design allows for the easy addition of new classifiers and tools 🛠️, showing promise for future scalability. Additionally, our choice to blend multiple techniques allowed for a more accurate identification of sensitive data 👍.
What we learned 🎓
We gained insights into the complexities of data privacy laws like GDPR 📜 and the challenges banks face in becoming compliant. On the technical side, we learned the power of combining various NLP and machine learning techniques to create a more robust solution 💪.
What's next for Complyr 🚀
The road ahead involves tuning our existing models for higher accuracy 🎯 and incorporating additional machine learning algorithms to expand the range of data types we can process 📈. We also plan to explore real-world applications and scalability options for Complyr 🌐.