Confidex

Inspiration

The inspiration behind Confidex came from the rising concerns surrounding data privacy in an age where AI tools are rapidly gaining traction. With employees increasingly using AI tools like ChatGPT and third-party SaaS platforms, organisations are at risk of accidentally sharing confidential and sensitive information, which may lead to privacy breaches, fraud, or legal consequences. We wanted to create a system that would automatically detect sensitive data, flag it, and explain why it's sensitive based on the company’s data privacy policies, allowing companies to be proactive about data protection.

What it does

Confidex is an AI-powered system designed to help organisations safeguard sensitive data by identifying, flagging, and explaining why specific data is confidential. It utilises a fine-tuned language model (DistilBERT) to detect a variety of sensitive data types, such as NRIC, bank account numbers, email addresses, and credit card details, within user inputs.

Once flagged, the system connects the detected data to relevant policy clauses using the Retrieval-Augmented Generation (RAG) pipeline. This allows Confidex to generate context-specific explanations as to why the flagged data is sensitive, ensuring transparency and compliance with organisational data privacy rules.

Additionally, Confidex offers an interactive dashboard for employers to monitor employee interactions with sensitive data. It tracks the most commonly flagged data, provides real-time logs of employee behaviour, and identifies high-risk employees, helping to prevent potential data leaks and breaches.

How we built it

Confidex was built by combining AI, backend, and frontend technologies. We began by selecting DistilBERT and fine-tuning it to recognise a wide variety of sensitive data labels like NRIC, Bank Account Number, Credit Card, Salary, etc. To improve the performance of our model, we used a Retrieval-Augmented Generation (RAG) approach, where relevant clauses from a company’s data protection policy are retrieved based on the flagged sensitive data and used to explain why it is sensitive.

We used FAISS for efficient similarity-based search to find the most relevant policy clauses. The backend is powered by Flask, which handles the API requests for querying the model and retrieving information from the policy. On the frontend, we built a React app with TailwindCSS to create a seamless, user-friendly interface for employees to interact with the system. The dashboard also provides employers with key insights, such as common sensitive terms being flagged and real-time logs of employee data inputs.

Challenges we ran into

One of the primary challenges we encountered was fine-tuning the DistilBERT model. The goal was to make it accurate across different contexts and able to flag a wide range of sensitive data types. To achieve this, we had to train the model with enough diverse examples and fine-tune it to recognise context-specific terminology, which required significant experimentation.

Integrating the RAG pipeline with the model also proved to be challenging. We had to ensure that the system could not only retrieve the correct sections from a large policy document but also do so in real time while maintaining a smooth and fast user experience. Additionally, creating a robust backend system with Flask and managing the data flow between the AI model and the frontend presented multiple hurdles, especially when it came to ensuring the system could scale efficiently.

Accomplishments that we're proud of

We’re proud of the strides we’ve made in fine-tuning DistilBERT to accurately identify a wide range of sensitive data. This wasn’t just about training a model; it was about ensuring the model could understand context and apply real-world data protection principles. Seeing it flag data and provide clear, understandable explanations based on company policy feels like a true success for us.

Another highlight was implementing the Retrieval-Augmented Generation (RAG) pipeline. We spent a lot of time perfecting the integration of document retrieval with policy explanation generation, and we’re excited to see it working seamlessly. The idea that our system can immediately pull up the right policy clause and explain why certain data is sensitive, in real-time, is something we’re really proud of.

Lastly, the dashboard is one of the features we’ve enjoyed working on the most. It’s gratifying to see how it allows employers to track, monitor, and understand employee behaviour around sensitive data. Knowing that we’ve built something that can help prevent data leaks and protect companies from potential breaches is deeply rewarding. For us, it’s not just about building a tool, it’s about creating something that genuinely adds value to businesses and contributes to a more secure digital environment.

What we learned

Throughout the development of Confidex, we gained a deeper understanding of several advanced AI concepts, particularly fine-tuning pre-trained models for specific tasks. One of the most important lessons was learning how to fine-tune DistilBERT for sensitive data detection, improving its ability to detect personal identifiers, financial data, and authentication credentials in diverse contexts. Moreover, we learned how Retrieval-Augmented Generation (RAG) pipelines work and how to integrate them to retrieve relevant policy clauses to provide context-sensitive answers.

On the technical side, we also deepened our knowledge of vector search using FAISS and how it can be used to quickly retrieve relevant documents from large datasets. We also got hands-on experience in building a complete solution using the Flask backend for API integration and React with TailwindCSS for the frontend.

What's next for Confidex

Moving forward, we plan to extend Confidex with additional AI models to improve the accuracy of sensitive data detection and expand the types of data we can recognize. We will also improve the real-time alert system, allowing employers to receive instant notifications about potential data breaches or high-risk actions. Furthermore, we aim to integrate Confidex with more enterprise tools, so that data privacy and compliance are built directly into everyday workflows. With growing concerns around data security, we envision Confidex as a solution that can scale to meet the needs of both small businesses and large corporations, providing comprehensive protection for sensitive data.