Deepfake-detection

🧠 Building deepfake-agentic-ai — An Engineering Journey A multi-agent deepfake detection pipeline built by a 2nd year CS student over 4 months, across midnight Docker sessions, a confusion matrix of zeros, and every layer of the modern ML backend stack. 💡 The Spark — Why I Built This It started during state election season. I was scrolling through social media and noticed it — fabricated faces, synthetic voices, diffusion-generated images passed off as real people saying real things. Misinformation wasn't abstract anymore. It had a face. A very convincing, AI-generated face. What bothered me more than the fakes themselves was the tools that existed to fight them. Every system I found gave you a binary answer: Code No reasoning. No signal breakdown. No explanation of why. Just a label, handed down like a verdict from a black box. In a world where deepfakes are being weaponised to manipulate elections and deceive voters — "we think it's fake" is not good enough. A system that can be wrong — silently, confidently wrong — is dangerous. So I decided to build something different. 🏗️ What I Built deepfake-agentic-ai is a multi-agent deepfake verification pipeline — not a classifier, but an auditing system. 3 independent FastAPI services communicating over HTTP Every module produces a score AND a reliability rating A weighted aggregator combines all signals A decider routes the final verdict with explainability Pipeline Flow Code Three Verdict States Verdict Condition Meaning FAKE score \geq 0.7 High confidence — deepfake detected REAL score \leq 0.3 High confidence — authentic FLAG_FOR_REVIEW 0.3 < score < 0.7 Uncertain — escalated The middle zone isn't a failure — it's honesty. The system knows when it doesn't know. ⚙️ The Architecture Signal Contract Every module speaks the same language, enforced via a shared Pydantic model: Python This contract enforced hard separation of concerns from day one: 🚫 ML service never touches agent logic 🚫 Agent service never re-downloads files 🚫 API never performs analysis ✅ Everything has exactly one job The Decision Math The aggregator computes a reliability-weighted score across all signals: Where: w_i = module base weight r_i = reliability of signal i s_i = score from signal i Path 3a — Middle Zone Uniform Boost (~45–55% score range): Path 3b — 70/30 Conflict Detection: When signals are split 70/30, only the conflicting modules receive targeted weight adjustment — precision where the disagreement actually lives. The Three Agents 🔍 ML Service RetinaFace (RetinaNetMobileNetV1) — face detection prithivMLmods/deepfake-detector-model-v1 (SiglipForImageClassification) — 94.4% benchmark accuracy Score clamped to [0.0, 1.0] Reliability scales with face coverage ratio No faces detected → score=0.5, reliability=0.1 — neutral signal, not a failure 🛡️ Source Verifier Pure Python forensics — zero ML dependency Checks: metadata stripping, file size anomalies, extension mismatches, hash presence Reliability fixed at 0.8 (deterministic logic = no model uncertainty) 🤖 Log Analyser Filtered structured logs → Gemma (gemma-3-12b-it) via SambaNova API Returns: anomaly score + natural language explanation Hardened with: SHA256-keyed in-memory cache Two-attempt retry with temperature fallback (0.1 → 0.0) Regex JSON extraction (strips markdown fences) Rule-based fallback when LLM is unavailable The LLM explains the verdict — it does not set it. 🔥 The Hardest Parts

Midnight Docker at Version Hell The ML service dependency stack is a maze: Code There were nights I was rebuilding images at midnight, changing one line in requirements.txt, watching a 10-minute build fail, changing it back, and wondering if I'd made any progress at all.
Making Services Talk Getting startup order right across three services — with healthchecks, cold-start race conditions, and depends_on chains — was the kind of problem with no dramatic solution. Yaml It's just careful, tedious configuration that either works or silently breaks at 2am.
The Confusion Matrix of Zeros The most humbling moment of the project. I ran evaluate.py on a balanced dataset — 140k Real and Fake Faces from Kaggle, seeded into PostgreSQL. The confusion matrix came back: Code Not wrong predictions. No predictions. Everything was landing in the middle zone (0.3–0.7) and being flagged for review. The thresholds were too conservative for the actual score distribution the model was producing. Fix: Widen temporarily (FAKE=0.55, REAL=0.45), observe clusters, tune to real F1 data. ML systems don't care about your architectural elegance. They care about whether the numbers line up.
CI Setup & GitHub Actions Five CI workflows now run on every push: Workflow What it checks pytest — API Smoke tests, import checks pytest — ML 7 unit tests, mocked model, CPU torch flake8 — Agents Lint, max-line-length=100 Network Audit Health, upload, MinIO reachability Schema Validation Response field shapes via Pydantic Getting all of these green — across three services with different dependency stacks — was its own project. 📚 What I Learned Four months. Breaks in between. But every layer of the modern backend stack got touched. Technical Stack Covered ML / CV — RetinaFace, SiglipForImageClassification, HuggingFace Transformers, PyTorch CPU builds, Docker volume model caching System Design — Service contracts, stateless aggregation, hard-blocked reanalysis loops, separation of concerns DevOps — GitHub Actions CI, GHCR image publishing, Docker Compose orchestration, Dozzle log monitoring Databases — PostgreSQL + SQLAlchemy, MinIO object storage, schema migrations Agents & LLMs — Retry logic, SHA256 caching, rule-based fallbacks, SambaNova / Gemma integration I also completed 5 Kaggle certifications and courses across multiple platforms — not in advance, but as I needed them. Reactive learning. It sticks better. The Biggest Mindset Shift Multi-service systems are not harder than single scripts. They're more honest. A single script hides complexity inside itself. A service boundary forces you to name your assumptions, define your interfaces, and own your failures explicitly. ✨ The Moment It Clicked There's a specific feeling when a multi-service pipeline runs end-to-end for the first time. You upload a file to the API. You watch the logs scroll in Dozzle. You see the ML service respond. The agents pick it up. The aggregator computes. The decider routes. A verdict appears in the database. It's not dramatic. It's a few lines of JSON. But after months of each service working in isolation — seeing the whole thing move together felt like watching a machine wake up. 🆚 What Makes This Different Most deepfake detectors are binary classifiers. They output a label. No reasoning, no confidence stratification, no handling of ambiguity. This system is built on a different premise: Uncertainty is information. A score of 0.85 from three independent high-reliability signals is very different from a score of 0.85 from one signal with low face coverage and stripped metadata. The aggregation captures that difference. What Role ML Detection Scores the content Source Verifier Audits the origin Log Analyser (LLM) Explains — never decides Aggregator Weights by reliability Decider Routes with traceable logic Every verdict can be traced back through its contributing signals, weights, and reliability scores. That's the explainability existing systems don't offer. 🛣️ What's Left Phase 3b — Path 3a end-to-end wiring + conflict detection (Path 3b) Phase 3c — Audio deepfake detection (RawNet2 / wav2vec) Phase 4 — Streamlit UI → MLflow tracking → Oracle Cloud deployment → public demo The hard parts are done. The architecture held. The pipeline runs. The rest is finishing what I started. Built by Santhosh — 2nd year CS Engineering student, building production-grade systems before most people have decided what to build.

Built With

architecture
deepfake
docker
fastapi
for
github-actions
hugging-face-transformers
minio
ml
multi-signal
opencv
postgresql
production-grade
python
pytorch
sambanova-llm-api-?-with-a-multi-agent

Submitted to

Created by

I designed and implemented the end-to-end system architecture for a production-grade multi-service deepfake detection platform, focusing on moving beyond single-model classification to a reliability-driven multi-signal decision system. I built the FastAPI-based API layer for secure media ingestion, validation, and orchestration, and structured the backend into modular microservices covering API, ML inference, and agent-based reasoning using Docker Compose for reproducibility. On the ML side, I integrated RetinaFace for face detection and a pretrained SigLIP-based deepfake classifier from Hugging Face, combining frame-level preprocessing with OpenCV and designing a reliability scoring mechanism based on face coverage and signal confidence. I also developed the core multi-signal fusion system by defining a standardized signal contract and implementing a weighted aggregator and threshold-based decider that routes outputs into real, fake, or review states depending on uncertainty. In addition, I built an agent layer that includes deterministic source verification logic and an LLM-based log analysis system using Gemma via SambaNova for anomaly detection and system introspection, with fallback reasoning for robustness. From an MLOps perspective, I integrated PostgreSQL for structured metadata storage, MinIO for object storage, and implemented structured logging across all services, along with CI pipelines for ML testing, API schema validation, and system health checks. I also developed an evaluation framework for end-to-end testing, confusion matrix generation, and threshold calibration, while resolving production issues such as Docker race conditions, model loading stability, and dependency version conflicts. Overall, my contribution focused on system architecture, multi-signal decision design, ML integration, and ensuring the system behaves as a reliable, production-ready AI service rather than a standalone model.

Santhosh P
Shrinithi Krishnamoorthy
TILAK D
NIVETHA A ARUNKUMAR M

Built With

Updates