Project Story

Inspiration

Every moderation tool on Reddit fires at one moment, submission. AutoModerator, Post Guidance, Crowd Control, the Harassment Filter, they all inspect content before it goes live.

We kept coming back to one unsettling fact, straight from AutoModerator's own documentation: it "will not act on content already approved or removed by a moderator" and "cannot react to a user's edits." In other words, the instant a moderator approves a post, every tool on the platform stops watching it.

That's not a small gap. It's an attack surface.

The highest-leverage moment to inject a scam link is not at submission, when scrutiny is highest, it is after approval, once a post has climbed the feed and every eye is on it. A wholesome post gets approved, trends, and then the author quietly edits in a bit.ly scam link, an affiliate code, or an off-platform "DM me to buy." Mods discover it hours later, from user reports, after the damage is done.

The research backed up our hunch:

Cornell's CSCW 2025 study calls AI-driven content a "very disruptive" triple threat
The 2026 CHI modqueue study found moderators "juggle multiple interfaces and third-party tools"
No existing tool covered the post-approval timeline

So we built one.

What it does

Tripwire is moderation’s rear-view mirror, the only tool that watches what happens after approval.

Capture , Snapshots exactly what was approved (title, body, links, domains, approving mod)
Watch , Detects edits and compares against the approved snapshot
Score , Computes a drift score across:

Links
Off-platform solicitation
Obfuscation
Structural changes

Act , Based on threshold:

Re-queue content
Notify moderators
Log silently

Review , Drift Log dashboard with:

Severity
Signals triggered
Author + approving mod
One-click actions (View / Restore / Remove)

When a scam link is injected post-approval, Tripwire catches it in seconds, automatically and explainably.

How we built it

Tripwire is a Devvit app (TypeScript) built on @devvit/public-api.

Core Components

Triggers
- ModAction , approvals
- PostUpdate / CommentUpdate , edits
- AppInstall , onboarding
Storage (Redis)
- Approval snapshots
- Watchlist (sorted set)
- Daily pruning via zRemRangeByScore
Reddit API
- Remove content
- Send modmail
- Add mod notes
UI
- Devvit Blocks , Drift Log dashboard

Drift Scoring Engine

A deterministic system combining signals using a noisy-OR model:

[ \text{score} = 1 - \prod_{i}(1 - c_i) ]

This ensures:

Weak signals reinforce each other
No single category dominates
Works without labeled training data

Security and Abuse Defenses

Built to match real-world adversarial behavior:

URL canonicalization (Google Safe Browsing)
Unicode UTS-39 homoglyph detection
Trojan Source defense (CVE-2021-42574)
Punycode decoding (RFC 3492)
Typosquat detection (edit distance + deglyphing)
Public Suffix List validation
Link cloaking detection
Dilution-resistant diffing:

[ \frac{|B \setminus A|}{|B|} ]

Challenges we ran into

No native AI support
- Only Gemini available, costly at scale
- Decision, go fully deterministic
Adversarial evasion
- Unicode tricks, hidden characters, cloaked URLs
- Required deep defensive engineering
Precision vs Recall
- False positives are worse than misses
- Auto-action requires ≥ 0.85 confidence
Real vs demo gap
- Example: bit.ly/test-link (no scheme) should not trigger
- Avoiding over-flagging was as hard as detection

Accomplishments that we're proud of

Identified and solved a previously unaddressed gap
135 unit tests, including adversarial cases
Fully validated on a live subreddit
Zero-config, free, and scalable
Fully explainable decisions, no black boxes

What we learned

The biggest problems are not always smarter models, sometimes they are unwatched surfaces
Deterministic systems can outperform AI in:
- Reliability
- Cost
- Explainability
Moderators trust tools that show their reasoning
Precision is more important than raw capability in moderation systems

What's next for Tripwire

Link rot and domain takeover detection
Sleeper account behavior tracking
Per-mod accountability analytics
Optional AI semantic drift detection (opt-in, non-critical path)
Score calibration using real moderator feedback

Built With

devvit
devvit-blocks
reddit-developer-platform
redis
typescript

Updates

Kaustubh Pardeshi started this project — May 27, 2026 04:15 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.