Inspiration
The increasing importance of data privacy and the rise in regulations like GDPR and CCPA inspired us to create DataBlind. We wanted to develop a tool that helps organizations manage Personally Identifiable Information (PII) more effectively, ensuring that sensitive data is masked or anonymized before sharing or processing.
What it does
DataBlind automatically detects and masks PII in text files. It supports various types of sensitive information, including email addresses, phone numbers, social security numbers, credit card numbers, names, dates, and locations. Users can upload single or batch files, customize the types of PII to mask, and download the processed files along with a report summarizing the detected PII
How we built it
We built DataBlind using Flask for the backend and integrated a SpaCy-based NLP model for PII detection. The frontend was developed using Streamlit for an interactive user experience. We utilized Pandas for handling file processing, allowing for efficient reading and manipulation of data. The system also includes a chatbot component that leverages OpenAI's GPT for natural language processing and interaction.
Challenges we ran into
One of the major challenges was ensuring the accuracy of PII detection, especially when dealing with unstructured text. We also faced issues with processing batch files in different formats, and integrating the masking and reporting features seamlessly.
Accomplishments that we're proud of
We successfully implemented a robust system that can handle a wide variety of PII types and formats. Our model's ability to detect and mask PII with high accuracy is something we're particularly proud of. We also developed a user-friendly interface that allows users to interact with the system easily, whether they are processing a single file or a batch of files. And we leveraged OpenAI's GPT4 model for our conversational RAG chatbot.
What we learned
Throughout the development of DataBlind, we learned a lot about natural language processing, data privacy regulations, and file handling in web applications. We also gained experience in optimizing performance for batch processing and integrating AI models into a Flask-based web application.
What's next for DataBlind
We plan to expand DataBlind's capabilities to support more file types and formats, such as PDFs and images. We also aim to improve the accuracy and speed of PII detection further and add features like automated compliance reporting and integration with cloud storage solutions. Additionally, we hope to incorporate more sophisticated data anonymization techniques to enhance privacy protection. We are going to incorporate the big data technologies with this solution for larger datasets and fast processing.
Built With
- flask
- openai
- rag
- spacy
- streamlit
Log in or sign up for Devpost to join the conversation.