Inspiration
Every DevOps engineer knows the pain: endless pings, false alarms, and sleepless nights. Traditional monitoring tools flood you with noise — CPU spikes during deployments, transient timeouts, irrelevant warnings. The real signal gets buried.
We wanted an agent that understands context — that learns your system’s behavior, filters out the noise, and escalates only what truly matters. So we built CloudWatchman, an autonomous, multi-agent system that runs entirely inside your AWS account.
What it does
CloudWatchman is a team of AI agents that collaborate to detect, analyze, and report real issues — not just raw alerts.
🤖 Multi-Agent System
- Supervisor Agent: Orchestrates analysis and final decisions
- Log Analyst: Performs semantic log search and root-cause inference
- Metrics Analyst: Detects metric anomalies and correlates patterns
Agents communicate asynchronously through SNS/SQS, forming a resilient, self-improving monitoring layer.
🧠 Reinforcement Learning Loop
- Learns from your one-click feedback: helpful vs false positive
- Adjusts thresholds using Q-learning based on time, service, and error context
- Evolves with every incident you rate
🔍 Intelligent Analysis
- Semantic search on CloudWatch logs via SageMaker embeddings
- Supports natural-language queries like: “show me payment timeouts in the last hour”
- Correlates logs + metrics to detect causal relationships
- Identifies new anomalies vs historical noise
📧 Smart Alerting
- Email alerts with contextual summaries and one-click feedback
- Optional JIRA integration for issue tracking
- PDF summaries with evidence and confidence scores
- Feedback automatically improves the Q-table for smarter alerts next time
How we built it
AI Core
- AWS Bedrock (Amazon Nova Micro) → reasoning & decision layer
- Amazon SageMaker → embedding model endpoints for semantic search
- DynamoDB → Q-learning tables + agent state storage
Agent Infrastructure
- ECS Fargate → scalable container deployment
- SNS / SQS → inter-agent messaging
- Lambda → data prep & metric extraction
- EventBridge → periodic autonomous scans
Data & Integrations
- Kinesis Firehose → real-time ingestion
- CloudWatch Logs + S3 → storage & retrieval
- SES → alert delivery
- Secrets Manager → secure JIRA credentials
- React + Ink CLI → interactive terminal dashboard
Challenges we ran into
- Agent Coordination — defining a supervisor–specialist pattern that avoids message loops.
- Q-Learning in Production — designing meaningful state/action spaces with sparse human feedback.
- Semantic Search Scaling — embedding millions of log lines efficiently using cosine similarity in DynamoDB.
- Streaming LLM Responses — real-time feedback without blocking the CLI.
- Dynamic Config Refresh — agents reload new log groups automatically every few tasks.
Accomplishments
- ✨ Fully autonomous multi-agent system (Supervisor + Specialists)
- 🎯 Q-Learning that visibly reduces false positives after ~20 rated alerts
- 🔍 Working semantic log search with natural-language queries
- 🏗️ Production-grade AWS deployment with modular IaC
- 📊 Auto-refreshing agent configuration
- 🎨 CLI dashboard and onboarding flow
- 🔄 Closed feedback loop: feedback → learning → improved accuracy
What we learned
- Multi-Agent Design: Roles matter more than raw intelligence — clear specialization wins.
- Practical RL: Q-learning works when the state space is well-bounded and user feedback exists.
- Embeddings for Logs: Vector similarity beats regex for noisy data.
- AWS Bedrock Tool Use: Streaming + function calling simplifies orchestration.
- IaC Discipline: Abstractions for ECS/SQS/SNS reduce friction in multi-service deployments.
What’s next
- 🚀 New Agents: Security, Cost, and Performance specialists
- 🔗 More Integrations: Slack, PagerDuty, Datadog
- 🧪 Advanced RL: DQN + multi-armed bandits for adaptive strategies
- 💡 Explainable AI: visualize Q-table decisions and confidence metrics
CloudWatchman transforms AWS monitoring from noisy alerts to intelligent collaboration. Agents that think, learn, and act — all inside your AWS account.
Built With
- amazon-dynamodb
- amazon-web-services
- bedrock
- cloudwatch
- embeddings
- eventbridge
- ink
- lambda
- node.js
- nova
- q-learning
- reinforcement-learning
- ses
- sqs
- terraform
- titan
- typescript
Log in or sign up for Devpost to join the conversation.