EC2 Disk Monitor with AI Agent - Complete Documentation

Project Overview

The EC2 Disk Monitor is an intelligent infrastructure monitoring solution that combines AWS services with AI-powered analysis to provide real-time disk space monitoring, automated alerting, and natural language interaction capabilities.

Key Value Propositions

  • Proactive Monitoring: Prevents disk space issues before they impact services
  • AI-Powered Insights: Natural language commands and intelligent analysis
  • Multi-Interface Access: Web dashboard, CLI, and chat interfaces
  • Automated Alerting: Email notifications via SNS when thresholds are exceeded
  • Historical Tracking: S3-based data persistence for trend analysis

System Architecture

Architecture Pattern: Event-Driven Monitoring with Real-Time Dashboard

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  User Layer     │     │  Application    │     │  AWS Services   │
│                 │     │  Layer          │     │  Layer          │
│ • Web Dashboard │◄───►│ • AI Agent      │◄───►│ • EC2 Instances │
│ • Chat Interface│     │ • NLP Engine    │     │ • SSM Agent     │
│ • CLI Tools     │     │ • Analysis      │     │ • S3 Storage    │
│                 │     │   Engine        │     │ • SNS Alerts    │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Data Flow Architecture

  1. Discovery: Tag-based EC2 instance identification
  2. Collection: SSM-based remote command execution (df -h)
  3. Analysis: Local AI processing with intelligent recommendations
  4. Storage: S3 persistence for historical data
  5. Alerting: SNS-based email notifications
  6. Visualization: Real-time web dashboard updates

Technical Stack

Core Technologies

  • Language: Python 3.8+
  • Web Framework: Streamlit (Interactive dashboards)
  • AWS SDK: Boto3 (AWS service integration)
  • Data Visualization: Plotly (Interactive charts)
  • Data Processing: Pandas (Data manipulation)

AWS Services Integration

  • EC2: Target instance management and discovery
  • SSM (Systems Manager): Remote command execution
  • S3: Historical data storage and report archiving
  • SNS: Email alert delivery system
  • IAM: Security and access management

Project Structure & Components

File Organization

ec2-disk-monitor/
├── 📊 streamlit_app.py          # Web dashboard (16KB)
├── 🤖 ai_agent.py                # AI chat interface (18KB)
├── ⚙️ ec2_disk_monitor.py        # Core monitoring engine (17KB)
├── 🚀 launch_streamlit.py        # Streamlit launcher (2KB)
├── 📝 config_template.py         # Configuration template
├── 📦 requirements.txt           # Python dependencies
├── 📖 README.md                  # Project documentation
├── 🚫 .gitignore                 # Git exclusion rules
└── 📜 LICENSE                    # MIT license

Component Details

1. Core Monitoring Engine (ec2_disk_monitor.py)

  • Purpose: Central monitoring logic and AWS service orchestration
  • Key Classes: EC2DiskMonitor
  • Responsibilities:
    • AWS credential management and service client initialization
    • EC2 instance discovery via tag filtering
    • SSM command execution with timeout handling
    • Disk usage data parsing and validation
    • AI-powered analysis with fallback mechanisms
    • S3 data persistence and SNS alert delivery

2. AI Agent (ai_agent.py)

  • Purpose: Natural language processing and conversational interface
  • Key Classes: EC2MonitoringAgent
  • Capabilities:
    • Natural language command interpretation
    • Context-aware response generation
    • Multi-format output (detailed/summary views)
    • Error handling and user guidance
    • Integration with core monitoring functions

3. Web Dashboard (streamlit_app.py)

  • Purpose: Interactive web-based monitoring interface
  • Features:
    • Real-time instance status overview
    • Interactive charts and visualizations
    • Configuration management (threshold adjustment)
    • Integrated AI chat interface
    • Historical report viewing
    • SNS alert testing capabilities

Core Workflows

1. Instance Discovery Workflow

# Tag-based filtering
Filters=[
    {'Name': 'tag:MonitorDiskSpace', 'Values': ['true']},
    {'Name': 'instance-state-name', 'Values': ['running']}
]
  • Scans all EC2 instances in configured region
  • Filters by MonitorDiskSpace=true tag
  • Validates instance state (running only)
  • Returns instance metadata (ID, type, tags)

2. Data Collection Workflow

# SSM command execution
ssm_client.send_command(
    InstanceIds=[instance_id],
    DocumentName='AWS-RunShellScript',
    Parameters={'commands': ['df -h']},
    TimeoutSeconds=30
)
  • Executes df -h command via SSM
  • Implements 30-second timeout protection
  • Parses filesystem data with validation
  • Handles command failures gracefully

3. Analysis Workflow

# Intelligent severity assessment
severity = {
    'CRITICAL': usage >= 90,
    'HIGH': usage >= 80,
    'MEDIUM': usage >= 60,
    'LOW': usage < 60
}
  • Analyzes disk usage patterns
  • Generates severity classifications
  • Provides actionable recommendations
  • Creates structured analysis reports

4. Alert Workflow

# Threshold-based alerting
if max_usage > threshold:
    sns_client.publish(
        TopicArn=sns_topic_arn,
        Subject=f"⚠️ EC2 Disk Alert: {instance_id}",
        Message=formatted_alert_message
    )
  • Evaluates usage against configurable thresholds
  • Formats comprehensive alert messages
  • Delivers via SNS to email subscribers
  • Includes dashboard links and recommendations

Configuration Management

Configuration Structure (config.py)

CONFIG = {
    # AWS Authentication
    'region': 'us-east-1',
    'aws_access_key_id': 'YOUR_ACCESS_KEY',
    'aws_secret_access_key': 'YOUR_SECRET_KEY',

    # Storage Configuration
    's3_bucket': 'your-bucket-name',

    # Alerting Configuration
    'sns_topic_arn': 'arn:aws:sns:region:account:topic',

    # Monitoring Parameters
    'threshold': 80,  # Percentage threshold for alerts
}

Security Considerations

  • Credentials stored in excluded config.py file
  • Template-based configuration for safe sharing
  • IAM role usage recommended over access keys
  • S3 bucket permissions should follow least-privilege principle

Security & Permissions

Required IAM Permissions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ssm:SendCommand",
        "ssm:GetCommandInvocation",
        "ssm:DescribeInstanceInformation",
        "s3:PutObject",
        "s3:GetObject",
        "s3:CreateBucket",
        "s3:HeadBucket",
        "sns:Publish"
      ],
      "Resource": "*"
    }
  ]
}

Security Best Practices

  • Credential Rotation: Regular AWS key rotation
  • Least Privilege: Minimal required permissions only
  • Network Security: VPC-based instance isolation
  • Data Encryption: S3 server-side encryption enabled
  • Access Logging: CloudTrail integration for audit trails

Deployment & Operations

Installation Process

  1. Environment Setup: Python 3.8+ installation
  2. Dependency Installation: pip install -r requirements.txt
  3. Configuration: Copy and customize config_template.py
  4. AWS Setup: Configure IAM permissions and tag instances
  5. Launch: Execute python launch_streamlit.py

Operational Workflows

Daily Operations

  • Monitor dashboard for threshold violations
  • Review S3-stored historical reports
  • Validate SNS alert delivery
  • Check instance tag compliance

Maintenance Tasks

  • Credential rotation (monthly)
  • Threshold adjustment based on usage patterns
  • S3 storage cleanup (automated lifecycle policies)
  • Performance optimization reviews

Troubleshooting Procedures

  • Connection Issues: Verify AWS credentials and permissions
  • Missing Data: Check SSM agent status on target instances
  • Alert Failures: Validate SNS topic configuration and subscriptions
  • Performance Issues: Review S3 access patterns and optimize queries

Monitoring & Analytics

Key Metrics Tracked

  • Disk Usage Percentage: Per filesystem monitoring
  • Available Space: Absolute values in GB/TB
  • Usage Trends: Historical growth patterns
  • Alert Frequency: Threshold violation rates
  • System Performance: Command execution times

Reporting Capabilities

  • Real-time Dashboard: Live usage visualization
  • Historical Reports: JSON-formatted detailed analysis
  • Trend Analysis: Usage pattern identification
  • Capacity Planning: Predictive insights for storage expansion

Data Retention

  • S3 Storage: Configurable lifecycle policies
  • Local Reports: Automatic cleanup after 30 days
  • Dashboard Cache: Real-time data with 5-minute refresh

API & Integration

AI Agent Command Interface

# Natural language processing examples
agent.process_command("list instances")
agent.process_command("show disk usage of i-1234567890abcdef0")
agent.process_command("monitor all servers")
agent.process_command("set threshold to 85%")

Programmatic Access

# Direct monitoring integration
from ec2_disk_monitor import EC2DiskMonitor

monitor = EC2DiskMonitor(**CONFIG)
results = monitor.run_monitoring_cycle()

Web API Endpoints

  • Streamlit provides built-in REST API for dashboard interactions
  • Real-time data updates via WebSocket connections
  • Configuration changes via interactive UI components

Testing & Quality Assurance

Testing Strategy

  • Unit Tests: Core function validation
  • Integration Tests: AWS service connectivity
  • End-to-End Tests: Complete workflow validation
  • Performance Tests: Scalability under load

Quality Metrics

  • Code Coverage: >90% for critical paths
  • Response Time: <30 seconds for SSM commands
  • Availability: 99.9% uptime target
  • Error Rate: <1% for normal operations

Validation Procedures

  • Configuration Validation: Startup credential verification
  • Data Validation: Disk usage parsing accuracy
  • Alert Validation: SNS delivery confirmation
  • UI Validation: Cross-browser compatibility testing

Performance & Scalability

Performance Characteristics

  • Instance Capacity: Supports 100+ instances per region
  • Concurrent Operations: Parallel SSM command execution
  • Data Processing: Real-time analysis with <5 second latency
  • Storage Efficiency: Compressed JSON reports in S3

Scalability Considerations

  • Horizontal Scaling: Multi-region deployment support
  • Vertical Scaling: Configurable timeout and batch sizes
  • Auto-scaling: Dynamic threshold adjustment based on patterns
  • Load Distribution: Round-robin SSM command scheduling

Optimization Strategies

  • Caching: Dashboard data caching for improved response times
  • Batching: Grouped SSM operations for efficiency
  • Compression: S3 storage optimization with gzip
  • Indexing: Efficient S3 object naming for quick retrieval

Future Enhancements

Planned Features

  • Machine Learning: Predictive capacity planning
  • Multi-Cloud: Azure and GCP integration
  • Advanced Analytics: Anomaly detection algorithms
  • Mobile Interface: Responsive design optimization
  • API Gateway: RESTful API for third-party integrations

Extensibility Points

  • Custom Analyzers: Pluggable analysis modules
  • Alert Channels: Slack, Teams, PagerDuty integration
  • Data Sources: Additional system metrics collection
  • Visualization: Custom dashboard components

Support & Maintenance

Documentation Resources

  • README.md: Quick start and basic usage
  • Inline Comments: Detailed code documentation
  • Configuration Guide: Setup and customization instructions
  • Troubleshooting Guide: Common issues and solutions

Community & Support

  • GitHub Issues: Bug reports and feature requests
  • Documentation Wiki: Extended usage examples
  • Community Forums: User discussions and best practices
  • Professional Support: Enterprise consulting available

Built With

Share this project:

Updates