EC2 Disk Monitor with AI Agent - Complete Documentation
Project Overview
The EC2 Disk Monitor is an intelligent infrastructure monitoring solution that combines AWS services with AI-powered analysis to provide real-time disk space monitoring, automated alerting, and natural language interaction capabilities.
Key Value Propositions
- Proactive Monitoring: Prevents disk space issues before they impact services
- AI-Powered Insights: Natural language commands and intelligent analysis
- Multi-Interface Access: Web dashboard, CLI, and chat interfaces
- Automated Alerting: Email notifications via SNS when thresholds are exceeded
- Historical Tracking: S3-based data persistence for trend analysis
System Architecture
Architecture Pattern: Event-Driven Monitoring with Real-Time Dashboard
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ User Layer │ │ Application │ │ AWS Services │
│ │ │ Layer │ │ Layer │
│ • Web Dashboard │◄───►│ • AI Agent │◄───►│ • EC2 Instances │
│ • Chat Interface│ │ • NLP Engine │ │ • SSM Agent │
│ • CLI Tools │ │ • Analysis │ │ • S3 Storage │
│ │ │ Engine │ │ • SNS Alerts │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Data Flow Architecture
- Discovery: Tag-based EC2 instance identification
- Collection: SSM-based remote command execution (df -h)
- Analysis: Local AI processing with intelligent recommendations
- Storage: S3 persistence for historical data
- Alerting: SNS-based email notifications
- Visualization: Real-time web dashboard updates
Technical Stack
Core Technologies
- Language: Python 3.8+
- Web Framework: Streamlit (Interactive dashboards)
- AWS SDK: Boto3 (AWS service integration)
- Data Visualization: Plotly (Interactive charts)
- Data Processing: Pandas (Data manipulation)
AWS Services Integration
- EC2: Target instance management and discovery
- SSM (Systems Manager): Remote command execution
- S3: Historical data storage and report archiving
- SNS: Email alert delivery system
- IAM: Security and access management
Project Structure & Components
File Organization
ec2-disk-monitor/
├── 📊 streamlit_app.py # Web dashboard (16KB)
├── 🤖 ai_agent.py # AI chat interface (18KB)
├── ⚙️ ec2_disk_monitor.py # Core monitoring engine (17KB)
├── 🚀 launch_streamlit.py # Streamlit launcher (2KB)
├── 📝 config_template.py # Configuration template
├── 📦 requirements.txt # Python dependencies
├── 📖 README.md # Project documentation
├── 🚫 .gitignore # Git exclusion rules
└── 📜 LICENSE # MIT license
Component Details
1. Core Monitoring Engine (ec2_disk_monitor.py)
- Purpose: Central monitoring logic and AWS service orchestration
- Key Classes: EC2DiskMonitor
- Responsibilities:
- AWS credential management and service client initialization
- EC2 instance discovery via tag filtering
- SSM command execution with timeout handling
- Disk usage data parsing and validation
- AI-powered analysis with fallback mechanisms
- S3 data persistence and SNS alert delivery
2. AI Agent (ai_agent.py)
- Purpose: Natural language processing and conversational interface
- Key Classes: EC2MonitoringAgent
- Capabilities:
- Natural language command interpretation
- Context-aware response generation
- Multi-format output (detailed/summary views)
- Error handling and user guidance
- Integration with core monitoring functions
3. Web Dashboard (streamlit_app.py)
- Purpose: Interactive web-based monitoring interface
- Features:
- Real-time instance status overview
- Interactive charts and visualizations
- Configuration management (threshold adjustment)
- Integrated AI chat interface
- Historical report viewing
- SNS alert testing capabilities
Core Workflows
1. Instance Discovery Workflow
# Tag-based filtering
Filters=[
{'Name': 'tag:MonitorDiskSpace', 'Values': ['true']},
{'Name': 'instance-state-name', 'Values': ['running']}
]
- Scans all EC2 instances in configured region
- Filters by MonitorDiskSpace=true tag
- Validates instance state (running only)
- Returns instance metadata (ID, type, tags)
2. Data Collection Workflow
# SSM command execution
ssm_client.send_command(
InstanceIds=[instance_id],
DocumentName='AWS-RunShellScript',
Parameters={'commands': ['df -h']},
TimeoutSeconds=30
)
- Executes df -h command via SSM
- Implements 30-second timeout protection
- Parses filesystem data with validation
- Handles command failures gracefully
3. Analysis Workflow
# Intelligent severity assessment
severity = {
'CRITICAL': usage >= 90,
'HIGH': usage >= 80,
'MEDIUM': usage >= 60,
'LOW': usage < 60
}
- Analyzes disk usage patterns
- Generates severity classifications
- Provides actionable recommendations
- Creates structured analysis reports
4. Alert Workflow
# Threshold-based alerting
if max_usage > threshold:
sns_client.publish(
TopicArn=sns_topic_arn,
Subject=f"⚠️ EC2 Disk Alert: {instance_id}",
Message=formatted_alert_message
)
- Evaluates usage against configurable thresholds
- Formats comprehensive alert messages
- Delivers via SNS to email subscribers
- Includes dashboard links and recommendations
Configuration Management
Configuration Structure (config.py)
CONFIG = {
# AWS Authentication
'region': 'us-east-1',
'aws_access_key_id': 'YOUR_ACCESS_KEY',
'aws_secret_access_key': 'YOUR_SECRET_KEY',
# Storage Configuration
's3_bucket': 'your-bucket-name',
# Alerting Configuration
'sns_topic_arn': 'arn:aws:sns:region:account:topic',
# Monitoring Parameters
'threshold': 80, # Percentage threshold for alerts
}
Security Considerations
- Credentials stored in excluded config.py file
- Template-based configuration for safe sharing
- IAM role usage recommended over access keys
- S3 bucket permissions should follow least-privilege principle
Security & Permissions
Required IAM Permissions
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ssm:SendCommand",
"ssm:GetCommandInvocation",
"ssm:DescribeInstanceInformation",
"s3:PutObject",
"s3:GetObject",
"s3:CreateBucket",
"s3:HeadBucket",
"sns:Publish"
],
"Resource": "*"
}
]
}
Security Best Practices
- Credential Rotation: Regular AWS key rotation
- Least Privilege: Minimal required permissions only
- Network Security: VPC-based instance isolation
- Data Encryption: S3 server-side encryption enabled
- Access Logging: CloudTrail integration for audit trails
Deployment & Operations
Installation Process
- Environment Setup: Python 3.8+ installation
- Dependency Installation:
pip install -r requirements.txt - Configuration: Copy and customize config_template.py
- AWS Setup: Configure IAM permissions and tag instances
- Launch: Execute
python launch_streamlit.py
Operational Workflows
Daily Operations
- Monitor dashboard for threshold violations
- Review S3-stored historical reports
- Validate SNS alert delivery
- Check instance tag compliance
Maintenance Tasks
- Credential rotation (monthly)
- Threshold adjustment based on usage patterns
- S3 storage cleanup (automated lifecycle policies)
- Performance optimization reviews
Troubleshooting Procedures
- Connection Issues: Verify AWS credentials and permissions
- Missing Data: Check SSM agent status on target instances
- Alert Failures: Validate SNS topic configuration and subscriptions
- Performance Issues: Review S3 access patterns and optimize queries
Monitoring & Analytics
Key Metrics Tracked
- Disk Usage Percentage: Per filesystem monitoring
- Available Space: Absolute values in GB/TB
- Usage Trends: Historical growth patterns
- Alert Frequency: Threshold violation rates
- System Performance: Command execution times
Reporting Capabilities
- Real-time Dashboard: Live usage visualization
- Historical Reports: JSON-formatted detailed analysis
- Trend Analysis: Usage pattern identification
- Capacity Planning: Predictive insights for storage expansion
Data Retention
- S3 Storage: Configurable lifecycle policies
- Local Reports: Automatic cleanup after 30 days
- Dashboard Cache: Real-time data with 5-minute refresh
API & Integration
AI Agent Command Interface
# Natural language processing examples
agent.process_command("list instances")
agent.process_command("show disk usage of i-1234567890abcdef0")
agent.process_command("monitor all servers")
agent.process_command("set threshold to 85%")
Programmatic Access
# Direct monitoring integration
from ec2_disk_monitor import EC2DiskMonitor
monitor = EC2DiskMonitor(**CONFIG)
results = monitor.run_monitoring_cycle()
Web API Endpoints
- Streamlit provides built-in REST API for dashboard interactions
- Real-time data updates via WebSocket connections
- Configuration changes via interactive UI components
Testing & Quality Assurance
Testing Strategy
- Unit Tests: Core function validation
- Integration Tests: AWS service connectivity
- End-to-End Tests: Complete workflow validation
- Performance Tests: Scalability under load
Quality Metrics
- Code Coverage: >90% for critical paths
- Response Time: <30 seconds for SSM commands
- Availability: 99.9% uptime target
- Error Rate: <1% for normal operations
Validation Procedures
- Configuration Validation: Startup credential verification
- Data Validation: Disk usage parsing accuracy
- Alert Validation: SNS delivery confirmation
- UI Validation: Cross-browser compatibility testing
Performance & Scalability
Performance Characteristics
- Instance Capacity: Supports 100+ instances per region
- Concurrent Operations: Parallel SSM command execution
- Data Processing: Real-time analysis with <5 second latency
- Storage Efficiency: Compressed JSON reports in S3
Scalability Considerations
- Horizontal Scaling: Multi-region deployment support
- Vertical Scaling: Configurable timeout and batch sizes
- Auto-scaling: Dynamic threshold adjustment based on patterns
- Load Distribution: Round-robin SSM command scheduling
Optimization Strategies
- Caching: Dashboard data caching for improved response times
- Batching: Grouped SSM operations for efficiency
- Compression: S3 storage optimization with gzip
- Indexing: Efficient S3 object naming for quick retrieval
Future Enhancements
Planned Features
- Machine Learning: Predictive capacity planning
- Multi-Cloud: Azure and GCP integration
- Advanced Analytics: Anomaly detection algorithms
- Mobile Interface: Responsive design optimization
- API Gateway: RESTful API for third-party integrations
Extensibility Points
- Custom Analyzers: Pluggable analysis modules
- Alert Channels: Slack, Teams, PagerDuty integration
- Data Sources: Additional system metrics collection
- Visualization: Custom dashboard components
Support & Maintenance
Documentation Resources
- README.md: Quick start and basic usage
- Inline Comments: Detailed code documentation
- Configuration Guide: Setup and customization instructions
- Troubleshooting Guide: Common issues and solutions
Community & Support
- GitHub Issues: Bug reports and feature requests
- Documentation Wiki: Extended usage examples
- Community Forums: User discussions and best practices
- Professional Support: Enterprise consulting available
Built With
- amazon-web-services
- bedrock
- boto3
- ec2
- iam
- pandas
- plotly
- python
- s3
- streamlit
Log in or sign up for Devpost to join the conversation.