🛡️ Project Story: SentinelAI
Inspiration
Data is the lifeblood of modern AI, but "Dirty Data" is its silent killer. During our research, we realized that traditional anomaly detection tools have a massive blind spot: they only understand Mathematics, not Logic.
A standard detector will accept a 2-year-old with a PhD or a "New" car with 500,000 miles because the numbers fall within a statistical range. We were inspired to bridge this gap—to build a tool that doesn't just find the unusual, but detects the impossible.
What it does
Hybrid-AI-Powered-Semantic-Anomaly-Detector is a domain-agnostic integrity engine. It performs a dual-layer audit:
- Statistical Layer: Uses Machine Learning (Isolation Forest) and Macro-Statistics (Z-Scores, IQR) to find numerical outliers and distribution shifts.
- Semantic Layer: Uses Google Gemini 2.5 Flash to audit the "Logical DNA" of the dataset, identifying contradictions that violate real-world constraints.
It concludes with a Visual Dashboard, including an Ensemble Convergence Matrix that shows exactly where Math and AI logic agree or disagree on data quality.
How we built it
We engineered a Hybrid Triad Architecture:
- The DNA Extractor: Built with Pandas and NumPy, this component extracts a macroscopic profile (Statistical DNA) of the dataset—capturing means, skewness, and correlation overlaps.
- The ML Engine: We integrated Scikit-Learn’s Isolation Forest to handle unsupervised row-level anomaly detection without needing labeled training data.
- The Semantic Brain: We integrated the Gemini 1.5 Flash API. By sending only the "DNA Profile" instead of raw rows, we allowed the model to reason over the entire dataset's context at once.
- The Visualizer: Developed using Matplotlib and Seaborn to translate complex audit logs into an intuitive Ensemble Convergence Matrix.
Challenges we ran into
The biggest hurdle was API Scalability. Sending millions of rows to an LLM is slow and prohibitively expensive. We solved this by inventing the Macro-Statistical Profiling method—sending a mathematical summary that represents the data’s behavior.
Another challenge was JSON Consistency. LLMs can sometimes hallucinate formatting. We overcame this by using Gemini’s native JSON schema mode and implementing a "Heuristic Priming" strategy in our prompts to ensure 100% reliable data parsing.
Accomplishments that we're proud of
- 99% Efficiency Gain: We successfully audited datasets with thousands of records using a single API call, reducing token costs by over 99% compared to traditional row-by-row LLM scanning.
- The Convergence Matrix: We are proud of our "Ensemble Intelligence" approach, which categorizes anomalies into a matrix, allowing users to see when an error is a "Math outlier" versus a "Logical failure."
- Universal Applicability: Our tool is completely domain-agnostic. It works as effectively on a Hospital database as it does on a Sneaker marketplace.
What we learned
We learned that Generative AI is far more powerful as a Reasoner than a Brute-Force Scanner. By providing Gemini with high-level statistical context (the "DNA"), it can deduce complex relationships that would take a human auditor hours to find. We also deepened our understanding of unsupervised machine learning and how it can act as a "first responder" for AI-led audits.
What's next for HYBRID-AI-POWERED-SEMANTIC-ANOMALY-DETECTOR
- Auto-Remediation: Developing an AI-driven "Auto-Fix" module that not only finds anomalies but generates the Python code to clean and repair the dataset automatically.
- Real-Time Streaming: Integrating with Kafka or Spark to audit live data streams for financial fraud or industrial sensor failures.
- Multi-Model Forensic Audit: Implementing an ensemble of Gemini 2.5 Pro and Flash to perform deeper "forensic" dives into critical data columns.
Built With
- google-gemini-api
- json
- kaggle
- matplotlib
- numpy
- pandas
- python
- scikit-learn
- seaborn
Log in or sign up for Devpost to join the conversation.