Inspiration
In today’s data-driven world, ensuring the accuracy and security of enterprise data is paramount. While working on a previous project involving sales rep claims analysis, I saw firsthand the potential risks of undetected anomalies in large datasets. This inspired me to create a scalable, automated anomaly detection system that could adapt to various types of data and ensure that any inconsistencies are flagged early on.
What it does
By leveraging machine learning algorithms, specifically Isolation Forest and Extended Isolation Forest, the system identifies outliers in the data that could indicate potential fraud, or irregular patterns.
After reducing the dimensions, the project provides visual representations of anomalies, making it easier for users to interpret results and investigate the flagged anomalies.
How we built it
Data Preprocessing & Cleaning: The first step involved cleaning and standardizing data across different sources to ensure consistency.
Dimensionality Reduction: To make the datasets more manageable and to speed up processing, I applied PCA and t-SNE to reduce dimensions while preserving the most important information.
Model Training: I trained Isolation Forest models on various datasets to detect anomalies. To further improve results, I integrated Extended Isolation Forest models and compared their performance with standard Isolation Forests.
Visualization: Using the reduced dimensions, I visualized the anomalies and normal data points, which helped me interpret the results and fine-tune the models.
Challenges we ran into
Data Volume: Handling large datasets across multiple systems posed a challenge. I optimized model performance by reducing dimensions. Feature Engineering: Each dataset required a customized approach, which involved testing different features to identify the most relevant ones for anomaly detection. Model Tuning: Balancing false positives and false negatives in the anomaly detection process required careful tuning of hyperparameters for both Isolation Forest and Extended Isolation Forest models.
Accomplishments that we're proud of
Created a detailed reporting system that tracks and records anomalies, making it easier for business leaders to review flagged transactions, audit data, and investigate potential issues.
What we learned
This project was a deep dive into both technical and domain challenges:
Dimensionality Reduction: I explored how to reduce the complexity of high-dimensional datasets while retaining essential variance, using techniques like PCA and t-SNE.
Isolation Forest Algorithms: I learned the inner workings of Isolation Forests and how to extend them to handle different types of data.
Feature Engineering: The importance of selecting the right features for anomaly detection became clear, as different datasets required tailored approaches for optimal results.
Handling Large Data: Managing the sheer volume of data, particularly across various systems like CRM, Grants, MIRF, and Concur, tested my ability to optimize performance and scalability.
What's next for Anomaly Detection of claims data
Explore the use of advanced machine learning models like Graph Neural Networks (GNNs) and Autoencoders to improve the accuracy and precision of anomaly detection, particularly for complex fraud patterns that are hard to capture with traditional models.
Built With
- cloud-services:-aws-(s3-for-storage
- databricks
- ec2-for-compute)
- frameworks:-scikit-learn
- lambda-for-serverless-functions
- languages:-python
- neptune
- powerbi
- pytoch
- s3
- tensorflow
Log in or sign up for Devpost to join the conversation.