Inspiration

Discord moderation is outdated. Most tools scan text for slurs, ignore voice entirely, and only act after harm has already happened. Yet the most damaging conflicts — shouting matches, harassment, coordinated pile-ons — happen in voice channels, where no moderation tool is listening.

We built Echo to change that: an autonomous community guardian that understands both text and voice, reasons about community dynamics over time, and intervenes only when it actually helps. With Gemini 3’s long-context reasoning and Gemini Live’s real-time audio understanding, Echo doesn’t react to keywords. It reads the room.

What it does

Echo is an autonomous moderation agent for Discord servers. It continuously observes text and voice activity, maintains a live understanding of community mood and context, and decides when — and how — to intervene.

From a user’s perspective: Echo feels like a calm, human moderator. But from a moderator’s perspective: Echo provides structured, real-time community intelligence.

Echo does not punish by default. It facilitates, de-escalates, and escalates to humans only when safety is at risk.

How we built it

We built Echo as a full end-to-end system designed for real-time operation, restraint, and safety.

Core system

  • Discord integration: Node.js + Discord.js for text and voice
  • Audio pipeline: Opus → PCM → 48 kHz mono with VAD and backpressure handling
  • State persistence: MySQL for live and historical server state
  • Real-time CLI Dashboard: Powered by direct DB access

AI layer

  • Gemini 3 Flash:
    • Batch text analysis
    • Cross-modal reasoning
    • Long-context server state modeling
  • Gemini 2.5 Flash Live:
    • Real-time voice semantic analysis via WebSocket
    • Ephemeral processing with no audio storage

Echo fuses text sentiment and voice tension into a single server state, enabling interventions based on patterns, not isolated messages.

Challenges we ran into

  • Real-time audio was difficult:
    • Discord audio arrives as Opus packets, while Gemini Live requires raw PCM
    • We built a custom decoding and buffering pipeline to avoid silent failures
  • Gemini Live disconnects after a few turns:
    • We implemented auto-reconnect with exponential backoff and rolling context summaries
  • Over-intervention harmed trust:
    • Early versions felt invasive
    • Confidence scoring, temporal decay, and cooldowns were required
  • Safety cannot rely on AI alone:
    • We added multilingual regex-based safety detection that bypasses Gemini entirely

Accomplishments that we’re proud of

  • Hearing Echo calmly de-escalate a live voice argument
  • Successfully linking text sentiment and voice tension into higher-quality decisions
  • Achieving truly ephemeral voice processing with zero storage
  • Building a dashboard that visualizes invisible community health in real time

What we learned

  • Facilitation is more effective than enforcement
  • Voice carries critical context that text alone misses
  • Longitudinal trends matter more than single messages
  • Safety systems must be deterministic, not probabilistic

What’s next

Near-term

  • Speaker diarization to identify who is escalating
  • User reputation and pattern tracking
  • Graduated automated actions (nudge → warning → mute)

Longer-term

  • Predictive intervention before conflicts erupt
  • Multi-turn mediation with follow-up
  • Community Health Graphs to detect cliques, bridges, and isolation
  • Expansion beyond Discord into Slack and live collaboration platforms

Built With

Share this project:

Updates