Inspiration

Every DevOps engineer knows the pain: endless pings, false alarms, and sleepless nights. Traditional monitoring tools flood you with noise — CPU spikes during deployments, transient timeouts, irrelevant warnings. The real signal gets buried.

We wanted an agent that understands context — that learns your system’s behavior, filters out the noise, and escalates only what truly matters. So we built CloudWatchman, an autonomous, multi-agent system that runs entirely inside your AWS account.


What it does

CloudWatchman is a team of AI agents that collaborate to detect, analyze, and report real issues — not just raw alerts.

🤖 Multi-Agent System

  • Supervisor Agent: Orchestrates analysis and final decisions
  • Log Analyst: Performs semantic log search and root-cause inference
  • Metrics Analyst: Detects metric anomalies and correlates patterns

Agents communicate asynchronously through SNS/SQS, forming a resilient, self-improving monitoring layer.

🧠 Reinforcement Learning Loop

  • Learns from your one-click feedback: helpful vs false positive
  • Adjusts thresholds using Q-learning based on time, service, and error context
  • Evolves with every incident you rate

🔍 Intelligent Analysis

  • Semantic search on CloudWatch logs via SageMaker embeddings
  • Supports natural-language queries like: “show me payment timeouts in the last hour”
  • Correlates logs + metrics to detect causal relationships
  • Identifies new anomalies vs historical noise

📧 Smart Alerting

  • Email alerts with contextual summaries and one-click feedback
  • Optional JIRA integration for issue tracking
  • PDF summaries with evidence and confidence scores
  • Feedback automatically improves the Q-table for smarter alerts next time

How we built it

AI Core

  • AWS Bedrock (Amazon Nova Micro) → reasoning & decision layer
  • Amazon SageMaker → embedding model endpoints for semantic search
  • DynamoDB → Q-learning tables + agent state storage

Agent Infrastructure

  • ECS Fargate → scalable container deployment
  • SNS / SQS → inter-agent messaging
  • Lambda → data prep & metric extraction
  • EventBridge → periodic autonomous scans

Data & Integrations

  • Kinesis Firehose → real-time ingestion
  • CloudWatch Logs + S3 → storage & retrieval
  • SES → alert delivery
  • Secrets Manager → secure JIRA credentials
  • React + Ink CLI → interactive terminal dashboard

Challenges we ran into

  1. Agent Coordination — defining a supervisor–specialist pattern that avoids message loops.
  2. Q-Learning in Production — designing meaningful state/action spaces with sparse human feedback.
  3. Semantic Search Scaling — embedding millions of log lines efficiently using cosine similarity in DynamoDB.
  4. Streaming LLM Responses — real-time feedback without blocking the CLI.
  5. Dynamic Config Refresh — agents reload new log groups automatically every few tasks.

Accomplishments

  • ✨ Fully autonomous multi-agent system (Supervisor + Specialists)
  • 🎯 Q-Learning that visibly reduces false positives after ~20 rated alerts
  • 🔍 Working semantic log search with natural-language queries
  • 🏗️ Production-grade AWS deployment with modular IaC
  • 📊 Auto-refreshing agent configuration
  • 🎨 CLI dashboard and onboarding flow
  • 🔄 Closed feedback loop: feedback → learning → improved accuracy

What we learned

  • Multi-Agent Design: Roles matter more than raw intelligence — clear specialization wins.
  • Practical RL: Q-learning works when the state space is well-bounded and user feedback exists.
  • Embeddings for Logs: Vector similarity beats regex for noisy data.
  • AWS Bedrock Tool Use: Streaming + function calling simplifies orchestration.
  • IaC Discipline: Abstractions for ECS/SQS/SNS reduce friction in multi-service deployments.

What’s next

  • 🚀 New Agents: Security, Cost, and Performance specialists
  • 🔗 More Integrations: Slack, PagerDuty, Datadog
  • 🧪 Advanced RL: DQN + multi-armed bandits for adaptive strategies
  • 💡 Explainable AI: visualize Q-table decisions and confidence metrics

CloudWatchman transforms AWS monitoring from noisy alerts to intelligent collaboration. Agents that think, learn, and act — all inside your AWS account.

Built With

Share this project:

Updates