What ArgusAI is
ArgusAI is a forensic media investigation platform for images, video, and audio. It helps users determine whether content is authentic by producing an evidence trail they can inspect, question, and audit, rather than a single score. And it gets more reliable over time by watching its own detectors and adjusting which ones it trusts.
The design in one line:
The forensic pipeline investigates the media. A second agent investigates the forensic pipeline.
Who it is for
ArgusAI is built for the people who have to decide whether a piece of media can be trusted before they act on it: journalists and fact-checkers verifying a viral clip before publishing, trust-and-safety and content-moderation teams reviewing reported uploads, and anyone assessing media that might be used as evidence. For them a single confidence score is not enough. They need to see what the evidence is, question it, and escalate uncertain cases to a human. ArgusAI is designed around that workflow: upload media, read the evidence trail, ask follow-up questions, and send low-confidence or flagged cases to a human-review queue.
Architecture at a glance
Upload (image / video / audio)
|
v
Forensic Pipeline ---- Gemini (semantic, OSINT, explanations)
(only the signals that fit the media type)
|
v
Evidence Trail ----> user reads, questions, gives feedback
|
+---------------+-------------------+
| |
v v
Arize Phoenix Firestore
(traces: how it (confirmed outcomes:
behaved - latency, was the verdict
tokens, errors) actually right)
| |
+----------------+------------------+
v
Reliability Agent (Google Agent Builder + Gemini)
reads Phoenix via Arize Phoenix MCP, fuses with Firestore,
ranks detectors by value-for-cost, detects drift
|
v
Weight overrides written to Firestore
|
v
Verdict engine reads them on every FUTURE analysis
(the loop: observability changes behavior)
The key idea: observability is not passive. The agent uses Phoenix telemetry to make decisions that change future verdicts.
How the detection works
ArgusAI runs a panel of independent forensic signals and only the ones that make sense for the media type.
- Images: a fine-tuned spectral model, metadata and provenance checks, sensor-noise analysis, lighting and physical-consistency reasoning through Gemini, error-level analysis, and a live OSINT research agent that searches the web for prior documentation or debunking.
- Video: the relevant image signals plus temporal-coherence and temporal-noise checks, and analysis of the embedded audio track.
- Audio: a voice-authenticity model, acoustic micro-signature analysis, semantic listening through Gemini, and OSINT.
Each signal is shown on its own card with what it checked, what it found, how much it influenced the verdict, and what else could explain the result. We do not rely on surface appearance, we target the deeper artifacts that come from how generators build content, and no single signal is treated as decisive. If the strong signals disagree, the system returns inconclusive instead of guessing.
How Arize Phoenix and Agent Builder work here
Every analysis is recorded as an Arize Phoenix trace using OpenInference instrumentation. Detector runs are tagged as tool spans, the analysis itself is a chain span, and every Gemini call is an LLM span with token counts, latency, and error status. When a person marks a verdict correct or incorrect, that judgment is posted back to Phoenix as a real span annotation and stored as a confirmed outcome in Firestore.
That gives the system two separate sources of truth: Phoenix tells us how it behaved, Firestore tells us whether it was right. Together they let the system learn not just which detectors are accurate, but which detectors are actually worth trusting.
On top of that we built a reliability agent on Google Agent Builder, powered by Gemini. When it runs, it:
- Reads per-detector telemetry (runs, errors, latency, token cost) from Phoenix through the Arize Phoenix MCP.
- Fuses that with confirmed accuracy from Firestore into a value-for-cost ranking of every detector.
- Checks for drift, meaning a detector whose recent confirmed accuracy has dropped below its historical baseline.
- Acts. It writes a weight override to Firestore, which the verdict engine reads on every future analysis, so the detector's evidence genuinely counts for less going forward. A detector that is both unreliable and expensive can be taken out of the verdict entirely until a human reactivates it.
That last decision depends on latency and cost data that lives only in Arize Phoenix, so it is a call the accuracy-only side of the system could not make on its own. Phoenix here is not a log store. It is the evidence the agent acts on, and that action changes what the system trusts next time.
Design details that make it robust
- Two distinct weight mechanisms, on purpose. A slow passive loop reacts to a detector's lifetime confirmed accuracy. The agent reacts to recent drift plus Phoenix telemetry and writes an override that takes precedence. They work on different signals and timescales, so the agent is not a duplicate of the passive loop, and its action is provably its own.
- Drift is a real production concern, not a gimmick. A fixed detector does not change internally, but its real-world accuracy genuinely decays as new generators appear. Monitoring confirmed accuracy and down-weighting a degrading detector is a legitimate way to keep a detection system honest over time.
- Human oversight is built in. Every agent action is reversible. A benched detector can be reactivated by a person from the operator console at any time.
- Real Agent Builder and MCP usage. The agent is built on Google Agent Builder and reads Phoenix through the official
@arizeai/phoenix-mcpserver, verified calling Phoenix MCP tools against our own Phoenix instance, and its actions genuinely change future verdict behavior.
How we built it
FastAPI backend that detects media type from magic bytes and routes each type through the right pipeline. Gemini for semantic reasoning, OSINT research, and explanations. Arize Phoenix, self-hosted, for observability. A cross-store layer that fuses Phoenix REST telemetry with Firestore outcomes. A single React frontend, with a consumer side for the evidence trail and follow-up chat, and an operator console where the reliability agent runs, streams its steps, and visibly updates a detector's weight when it recalibrates.
Log in or sign up for Devpost to join the conversation.