Project Story

Inspiration

Every moderation tool on Reddit fires at one moment, submission. AutoModerator, Post Guidance, Crowd Control, the Harassment Filter, they all inspect content before it goes live.

We kept coming back to one unsettling fact, straight from AutoModerator's own documentation: it "will not act on content already approved or removed by a moderator" and "cannot react to a user's edits." In other words, the instant a moderator approves a post, every tool on the platform stops watching it.

That's not a small gap. It's an attack surface.

The highest-leverage moment to inject a scam link is not at submission, when scrutiny is highest, it is after approval, once a post has climbed the feed and every eye is on it. A wholesome post gets approved, trends, and then the author quietly edits in a bit.ly scam link, an affiliate code, or an off-platform "DM me to buy." Mods discover it hours later, from user reports, after the damage is done.

The research backed up our hunch:

  • Cornell's CSCW 2025 study calls AI-driven content a "very disruptive" triple threat
  • The 2026 CHI modqueue study found moderators "juggle multiple interfaces and third-party tools"
  • No existing tool covered the post-approval timeline

So we built one.


What it does

Tripwire is moderation’s rear-view mirror, the only tool that watches what happens after approval.

  1. Capture , Snapshots exactly what was approved (title, body, links, domains, approving mod)

  2. Watch , Detects edits and compares against the approved snapshot

  3. Score , Computes a drift score across:

  • Links
  • Off-platform solicitation
  • Obfuscation
  • Structural changes
  1. Act , Based on threshold:
  • Re-queue content
  • Notify moderators
  • Log silently
  1. Review , Drift Log dashboard with:
  • Severity
  • Signals triggered
  • Author + approving mod
  • One-click actions (View / Restore / Remove)

When a scam link is injected post-approval, Tripwire catches it in seconds, automatically and explainably.


How we built it

Tripwire is a Devvit app (TypeScript) built on @devvit/public-api.

Core Components

  • Triggers

    • ModAction , approvals
    • PostUpdate / CommentUpdate , edits
    • AppInstall , onboarding
  • Storage (Redis)

    • Approval snapshots
    • Watchlist (sorted set)
    • Daily pruning via zRemRangeByScore
  • Reddit API

    • Remove content
    • Send modmail
    • Add mod notes
  • UI

    • Devvit Blocks , Drift Log dashboard

Drift Scoring Engine

A deterministic system combining signals using a noisy-OR model:

[ \text{score} = 1 - \prod_{i}(1 - c_i) ]

This ensures:

  • Weak signals reinforce each other
  • No single category dominates
  • Works without labeled training data

Security and Abuse Defenses

Built to match real-world adversarial behavior:

  • URL canonicalization (Google Safe Browsing)
  • Unicode UTS-39 homoglyph detection
  • Trojan Source defense (CVE-2021-42574)
  • Punycode decoding (RFC 3492)
  • Typosquat detection (edit distance + deglyphing)
  • Public Suffix List validation
  • Link cloaking detection
  • Dilution-resistant diffing:

[ \frac{|B \setminus A|}{|B|} ]


Challenges we ran into

  • No native AI support

    • Only Gemini available, costly at scale
    • Decision, go fully deterministic
  • Adversarial evasion

    • Unicode tricks, hidden characters, cloaked URLs
    • Required deep defensive engineering
  • Precision vs Recall

    • False positives are worse than misses
    • Auto-action requires ≥ 0.85 confidence
  • Real vs demo gap

    • Example: bit.ly/test-link (no scheme) should not trigger
    • Avoiding over-flagging was as hard as detection

Accomplishments that we're proud of

  • Identified and solved a previously unaddressed gap
  • 135 unit tests, including adversarial cases
  • Fully validated on a live subreddit
  • Zero-config, free, and scalable
  • Fully explainable decisions, no black boxes

What we learned

  • The biggest problems are not always smarter models, sometimes they are unwatched surfaces

  • Deterministic systems can outperform AI in:

    • Reliability
    • Cost
    • Explainability
  • Moderators trust tools that show their reasoning

  • Precision is more important than raw capability in moderation systems


What's next for Tripwire

  • Link rot and domain takeover detection
  • Sleeper account behavior tracking
  • Per-mod accountability analytics
  • Optional AI semantic drift detection (opt-in, non-critical path)
  • Score calibration using real moderator feedback

Built With

Share this project:

Updates