🧠 Building deepfake-agentic-ai β€” An Engineering Journey A multi-agent deepfake detection pipeline built by a 2nd year CS student over 4 months, across midnight Docker sessions, a confusion matrix of zeros, and every layer of the modern ML backend stack. πŸ’‘ The Spark β€” Why I Built This It started during state election season. I was scrolling through social media and noticed it β€” fabricated faces, synthetic voices, diffusion-generated images passed off as real people saying real things. Misinformation wasn't abstract anymore. It had a face. A very convincing, AI-generated face. What bothered me more than the fakes themselves was the tools that existed to fight them. Every system I found gave you a binary answer: Code No reasoning. No signal breakdown. No explanation of why. Just a label, handed down like a verdict from a black box. In a world where deepfakes are being weaponised to manipulate elections and deceive voters β€” "we think it's fake" is not good enough. A system that can be wrong β€” silently, confidently wrong β€” is dangerous. So I decided to build something different. πŸ—οΈ What I Built deepfake-agentic-ai is a multi-agent deepfake verification pipeline β€” not a classifier, but an auditing system. 3 independent FastAPI services communicating over HTTP Every module produces a score AND a reliability rating A weighted aggregator combines all signals A decider routes the final verdict with explainability Pipeline Flow Code Three Verdict States Verdict Condition Meaning FAKE score \geq 0.7 High confidence β€” deepfake detected REAL score \leq 0.3 High confidence β€” authentic FLAG_FOR_REVIEW 0.3 < score < 0.7 Uncertain β€” escalated The middle zone isn't a failure β€” it's honesty. The system knows when it doesn't know. βš™οΈ The Architecture Signal Contract Every module speaks the same language, enforced via a shared Pydantic model: Python This contract enforced hard separation of concerns from day one: 🚫 ML service never touches agent logic 🚫 Agent service never re-downloads files 🚫 API never performs analysis βœ… Everything has exactly one job The Decision Math The aggregator computes a reliability-weighted score across all signals: Where: w_i = module base weight r_i = reliability of signal i s_i = score from signal i Path 3a β€” Middle Zone Uniform Boost (~45–55% score range): Path 3b β€” 70/30 Conflict Detection: When signals are split 70/30, only the conflicting modules receive targeted weight adjustment β€” precision where the disagreement actually lives. The Three Agents πŸ” ML Service RetinaFace (RetinaNetMobileNetV1) β€” face detection prithivMLmods/deepfake-detector-model-v1 (SiglipForImageClassification) β€” 94.4% benchmark accuracy Score clamped to [0.0, 1.0] Reliability scales with face coverage ratio No faces detected β†’ score=0.5, reliability=0.1 β€” neutral signal, not a failure πŸ›‘οΈ Source Verifier Pure Python forensics β€” zero ML dependency Checks: metadata stripping, file size anomalies, extension mismatches, hash presence Reliability fixed at 0.8 (deterministic logic = no model uncertainty) πŸ€– Log Analyser Filtered structured logs β†’ Gemma (gemma-3-12b-it) via SambaNova API Returns: anomaly score + natural language explanation Hardened with: SHA256-keyed in-memory cache Two-attempt retry with temperature fallback (0.1 β†’ 0.0) Regex JSON extraction (strips markdown fences) Rule-based fallback when LLM is unavailable The LLM explains the verdict β€” it does not set it. πŸ”₯ The Hardest Parts

  1. Midnight Docker at Version Hell The ML service dependency stack is a maze: Code There were nights I was rebuilding images at midnight, changing one line in requirements.txt, watching a 10-minute build fail, changing it back, and wondering if I'd made any progress at all.
  2. Making Services Talk Getting startup order right across three services β€” with healthchecks, cold-start race conditions, and depends_on chains β€” was the kind of problem with no dramatic solution. Yaml It's just careful, tedious configuration that either works or silently breaks at 2am.
  3. The Confusion Matrix of Zeros The most humbling moment of the project. I ran evaluate.py on a balanced dataset β€” 140k Real and Fake Faces from Kaggle, seeded into PostgreSQL. The confusion matrix came back: Code Not wrong predictions. No predictions. Everything was landing in the middle zone (0.3–0.7) and being flagged for review. The thresholds were too conservative for the actual score distribution the model was producing. Fix: Widen temporarily (FAKE=0.55, REAL=0.45), observe clusters, tune to real F1 data. ML systems don't care about your architectural elegance. They care about whether the numbers line up.
  4. CI Setup & GitHub Actions Five CI workflows now run on every push: Workflow What it checks pytest β€” API Smoke tests, import checks pytest β€” ML 7 unit tests, mocked model, CPU torch flake8 β€” Agents Lint, max-line-length=100 Network Audit Health, upload, MinIO reachability Schema Validation Response field shapes via Pydantic Getting all of these green β€” across three services with different dependency stacks β€” was its own project. πŸ“š What I Learned Four months. Breaks in between. But every layer of the modern backend stack got touched. Technical Stack Covered ML / CV β€” RetinaFace, SiglipForImageClassification, HuggingFace Transformers, PyTorch CPU builds, Docker volume model caching System Design β€” Service contracts, stateless aggregation, hard-blocked reanalysis loops, separation of concerns DevOps β€” GitHub Actions CI, GHCR image publishing, Docker Compose orchestration, Dozzle log monitoring Databases β€” PostgreSQL + SQLAlchemy, MinIO object storage, schema migrations Agents & LLMs β€” Retry logic, SHA256 caching, rule-based fallbacks, SambaNova / Gemma integration I also completed 5 Kaggle certifications and courses across multiple platforms β€” not in advance, but as I needed them. Reactive learning. It sticks better. The Biggest Mindset Shift Multi-service systems are not harder than single scripts. They're more honest. A single script hides complexity inside itself. A service boundary forces you to name your assumptions, define your interfaces, and own your failures explicitly. ✨ The Moment It Clicked There's a specific feeling when a multi-service pipeline runs end-to-end for the first time. You upload a file to the API. You watch the logs scroll in Dozzle. You see the ML service respond. The agents pick it up. The aggregator computes. The decider routes. A verdict appears in the database. It's not dramatic. It's a few lines of JSON. But after months of each service working in isolation β€” seeing the whole thing move together felt like watching a machine wake up. πŸ†š What Makes This Different Most deepfake detectors are binary classifiers. They output a label. No reasoning, no confidence stratification, no handling of ambiguity. This system is built on a different premise: Uncertainty is information. A score of 0.85 from three independent high-reliability signals is very different from a score of 0.85 from one signal with low face coverage and stripped metadata. The aggregation captures that difference. What Role ML Detection Scores the content Source Verifier Audits the origin Log Analyser (LLM) Explains β€” never decides Aggregator Weights by reliability Decider Routes with traceable logic Every verdict can be traced back through its contributing signals, weights, and reliability scores. That's the explainability existing systems don't offer. πŸ›£οΈ What's Left Phase 3b β€” Path 3a end-to-end wiring + conflict detection (Path 3b) Phase 3c β€” Audio deepfake detection (RawNet2 / wav2vec) Phase 4 β€” Streamlit UI β†’ MLflow tracking β†’ Oracle Cloud deployment β†’ public demo The hard parts are done. The architecture held. The pipeline runs. The rest is finishing what I started. Built by Santhosh β€” 2nd year CS Engineering student, building production-grade systems before most people have decided what to build.

Built With

  • architecture
  • deepfake
  • docker
  • fastapi
  • for
  • github-actions
  • hugging-face-transformers
  • minio
  • ml
  • multi-signal
  • opencv
  • postgresql
  • production-grade
  • python
  • pytorch
  • sambanova-llm-api-?-with-a-multi-agent
Share this project:

Updates