Architecture Documentation

System Overview

The AWS Error Detection and Ranking system is an intelligent AI-powered solution that automatically monitors, analyzes, and prioritizes CloudWatch error logs. Built for the AWS Agent Hackathon, it leverages Amazon Bedrock, AgentCore, and various AWS services to transform error management from reactive firefighting to proactive, AI-driven prioritization.

Project Background

Inspiration

Internally, our company faced a critical operational challenge: monitoring thousands of CloudWatch logs across multiple services and environments. Development teams were overwhelmed by the sheer volume of error logs, spending hours manually sifting through CloudWatch to identify which errors required immediate attention.

The core problem was the lack of context and prioritization. Not all errors are created equal:

  • Some errors indicate critical internal bugs requiring immediate developer action
  • Others represent external service issues completely outside our control
  • Many are transient issues that self-resolve without intervention

Without intelligent context and priority levels, teams wasted valuable time investigating non-actionable errors while critical issues sometimes went unnoticed in the noise. We needed a solution that could automatically separate the signal from the noise, providing developers with actionable intelligence rather than raw logs.

What It Does

Our solution creates an intelligent, automated error management pipeline:

  1. Automatic Detection: Continuously monitors CloudWatch log groups across your AWS infrastructure
  2. Intelligent Analysis: Uses Amazon Bedrock's Claude 3 Haiku via AgentCore to analyze each error with sophisticated reasoning
  3. Context Addition: Enriches errors with:
    • Severity assessment (low/medium/high)
    • Priority ranking (1-5 scale)
    • Actionability flag (fixable by developers?)
    • Category classification (API, Database, Lambda, Network, etc.)
    • Human-readable explanation of root cause
  4. Knowledge Base Learning: Builds institutional knowledge by:
    • Storing error patterns and resolutions
    • Matching new errors against historical data
    • Tracking resolution success rates
    • Continuously improving recommendations
  5. Smart Filtering: Separates critical internal bugs from external dependencies, allowing teams to focus on what they can actually fix

The result: Developers receive a prioritized, contextualized list of errors with clear guidance on what needs immediate attention versus what can be safely deprioritized.

How We Built It

Our architecture leverages multiple AWS services in a serverless, event-driven design:

Phase 1: Infrastructure Foundation

  • Built CDK infrastructure for the complete error detection pipeline
  • Implemented Step Functions state machine to orchestrate log retrieval
  • Created Lambda functions for log processing and storage
  • Deployed DynamoDB tables for error logs, knowledge base, and session management

Phase 2: AI Agent Development

  • Developed custom MCP (Model Context Protocol) tools for error analysis
  • Integrated Strands framework for AgentCore agent orchestration
  • Connected Claude 3 Haiku via Amazon Bedrock for intelligent reasoning
  • Implemented 5 specialized tools:
    • AWS Health Checker for service status correlation
    • Knowledge Base Search for historical pattern matching
    • CloudWatch Metrics analyzer for performance trends
    • Error Knowledge Storage for learning
    • Resolution Tracking for success rate monitoring

Phase 3: Integration & Security

  • Containerized AgentCore runtime on ECS Fargate
  • Implemented Cognito JWT authentication with token caching
  • Built HTTPS integration between Lambda and AgentCore with automatic fallback
  • Added DynamoDB streams to trigger real-time error analysis

Phase 4: User Interface

  • Created React frontend for error visualization
  • Built Flask API backend for data access
  • Implemented filtering and sorting capabilities
  • Added real-time updates as errors are analyzed

Technology Stack:

  • AWS CDK for infrastructure as code
  • Python for Lambda functions and agent logic
  • Strands framework for agent orchestration
  • Amazon Bedrock (Claude 3 Haiku) for AI reasoning
  • Docker for AgentCore containerization
  • React + Flask for UI/API
  • DynamoDB for storage
  • Step Functions for workflow orchestration

Challenges We Ran Into

  1. Incomplete CDK Resources: During development, AWS CDK constructs for AgentCore were incomplete. At the moment of writing this document, this have been implemented, but when we worked on the project the necessary resources weren't yet implemented. This is understandable since AgentCore is in preview.
  2. Authentication and Authorization: Implementing secure HTTPS communication between Lambda and AgentCore proved challenging.
  3. Knowledge Base Cost Effectiveness: Early architecture included OpenSearch for knowledge base but the minimum OpenSearch cluster cost was too high for a proof of concept so that was replaced with DynamoDB for cost efficiency. We still have a plan to re-introduce OpenSearch Serverless in production

Accomplishments We're Proud Of

  1. Completion of the Solution:Despite the challenges, we delivered a complete, working solution from infrastructure to UI. The system successfully processes CloudWatch logs automatically, analyzes errors with AI-powered reasoning, stores knowledge for continuous learning, and presents results in an intuitive interface
  2. Multiple Customer Projects: Our team balanced multiple customer projects while building this hackathon solution. Everyone contributed their expertise from infrastructure engineers building the robust CDK implementations, to AI/ML specialists designing the agent architecture, to our Full-stack developers creating the necessary UI.
  3. Production Ready Solution: We didn't just build a demo, we created a production-ready system with security through Cognito JWT authentication, cost optimization, monitoring and observability built in and scalable serverless architecture.
  4. Self-learning Knowledge Base: The self-learning knowledge base is particularly impressive. It can automatically improve recommendations over time, tracks resolution success rates, builds institutional knowledge without manual curation, and provides data-driven guidance for error resolution

What We Learned

Technical Learnings:

  1. Strands Framework Mastery: Gained deep understanding of building agents with the Strands framework, including proper tool definition and registration, system prompt engineering for consistent behavior and multi-tool orchestration strategies.
  2. MCP Tool Development: Learned best practices for creating Model Context Protocol tools.
  3. Knowledge Base Design: Understood the nuances of error signature generation and normalization, similarity scoring algorithms, pattern extraction from unstructured logs and balancing storage vs. query performance.

Process Learnings:

  1. MVP First: Starting with minimal viable product and iterating proved more effective than trying to build everything at once
  2. Fallback Planning: Having backup plans (direct Bedrock calls) when primary approach (AgentCore) fails ensures system resilience
  3. Cost Awareness: Early cost analysis (discovering OpenSearch expense) saved significant budget
  4. Documentation Gaps: When official docs are incomplete, community collaboration and experimentation become essential

What's Next for Error Detection and Context Addition

This is a fully functional proof of concept, but several enhancements will make it production-ready for enterprise use:

Short-Term Improvements (1-3 months):

  1. OpenSearch Serverless Integration: Replace DynamoDB knowledge base with OpenSearch Serverless, which enables advanced similarity search with vector embeddings, improves pattern matching accuracy by 40-50%, and adds full-text search across error histories
  2. Human-in-the-Loop Tagging: Add UI for developers to confirm/correct AI categorizations to build feedback loop to improve accuracy
  3. Real-Time Streaming: Replace scheduled state machine with CloudWatch Logs Subscription, which would reduce detection latency from minutes to seconds
  4. Slack Notifications: Add instant Slack/PagerDuty notifications for critical errors
  5. Enhanced UI Features: UI features that we could add would be error trend visualization with charts, filtering by service, severity, and time range, and historical error comparison view
  6. Integration with CI/CD pipelines: This would make this stack easier to maintain.

Medium-Term Enhancements (3-6 months):

  1. Multi-Account Support: Cross-account CloudWatch log access via IAM roles with a centralized error dashboard across AWS accounts with account-level filtering and permissions.
  2. Custom ML Models: Train classification models on resolved error data with specialized models for different error categories.
  3. Advanced Analytics: Error trend prediction using time-series analysis, anomaly detection for unusual error patterns, impact analysis (errors affecting multiple services), and cost correlation (errors causing resource waste) could be implemented in the future.

Long-Term Vision (6-12 months):

  1. Proactive Error Prevention: Analyze code changes for potential error patterns, automated suggestions for error handling improvements.
  2. Root Cause Analysis: Trace errors across distributed systems, correlate errors with deployments, config changes, and suggest architectural improvements.
  3. Self-Healing Capabilities: Automatic remediation for known error patterns, configuration rollback on error spikess

The foundation is solid, and the path forward is clear. This project demonstrates the power of combining AWS services, AI agents, and knowledge management to solve real developer pain points.

Core Components

1. Log Ingestion Layer

EventBridge Scheduler

  • Purpose: Triggers periodic log generation for testing and demonstration
  • Frequency: Configurable schedule (default: every 5 minutes)
  • Trigger: Invokes State Machine for orchestrated processing

CloudWatch Log Groups

  • Purpose: Source of all error logs
  • Integration: Native AWS service integration
  • Access Pattern: Read-only via AWS SDK

2. Processing Orchestration Layer

Step Functions State Machine

Orchestrates the entire log processing workflow with the following states:

  1. Get Log Groups → Retrieves all available CloudWatch log groups
  2. Filter Log Streams → Identifies recent log streams with activity
  3. Process Logs → Extracts error entries and stores in DynamoDB
  4. Parallel Processing → Handles multiple log groups concurrently

State Machine Flow:

Start
  ↓
GetLogGroups (Lambda)
  ↓
FilterStreams (Lambda) [Map State - Parallel]
  ↓
ProcessLogs (Lambda) [Map State - Parallel]
  ↓
Success/Failure

3. Data Storage Layer

DynamoDB Tables

Error Logs Table

  • Primary Key: error_id (timestamp-based UUID)
  • Attributes:
    • log_group: Source log group name
    • log_stream: Source log stream name
    • timestamp: Error occurrence time
    • message: Raw error message
    • severity: AI-assigned severity level
    • analysis: Bedrock-generated context
    • priority: Criticality ranking
    • category: Error classification
    • actionable: Boolean flag for developer action needed
  • Streams: Enabled for real-time error analysis triggering
  • TTL: Optional retention policy

Session Table

  • Purpose: Manages AgentCore session state
  • Primary Key: session_id
  • Attributes:
    • jwt_token: Cached Cognito JWT
    • expires_at: Token expiration timestamp
    • created_at: Session creation time

4. AI Analysis Layer

Lambda: Error Analysis Trigger

  • Trigger: DynamoDB Stream from Error Logs Table
  • Batch Size: 10 records per invocation
  • Primary Function: Orchestrates communication with AgentCore
  • Authentication: Cognito JWT token management with caching
  • Fallback: Direct Bedrock API calls if AgentCore unavailable

AgentCore Runtime (ECS Fargate)

  • Container: Custom Docker image with Strands framework
  • Model: Claude 3 Haiku via Amazon Bedrock
  • Authentication: Cognito JWT validation
  • Networking: Private subnet with ALB for HTTPS access
  • Scaling: Auto-scaling based on CPU/memory

AgentCore Architecture:

┌──────────────────────────────────────┐
│        AgentCore Container           │
├──────────────────────────────────────┤
│  Strands Framework Agent             │
│    ├─ Claude 3 Haiku Integration     │
│    ├─ Tool Registry                  │
│    └─ Session Management             │
├──────────────────────────────────────┤
│  MCP Tools (5 specialized)           │
│    ├─ aws_health_checker             │
│    ├─ search_similar_errors          │
│    ├─ cloudwatch_metrics             │
│    ├─ store_error_knowledge          │
│    └─ update_error_resolution        │
├──────────────────────────────────────┤
│  AWS SDK Integrations                │
│    ├─ DynamoDB Client                │
│    ├─ CloudWatch Client              │
│    └─ Health Client                  │
└──────────────────────────────────────┘

Architecture Diagram

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   EventBridge   │───▶│  State Machine   │───▶│   CloudWatch    │
│  (Scheduler)    │    │  (Orchestrator)  │    │   Log Groups    │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │                        │
                                ▼                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │ Process Logs    │◀───│ Filter Streams  │
                       │    Lambda       │    │    Lambda       │
                       └─────────────────┘    └─────────────────┘
                                │
                                ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │    DynamoDB     │───▶│ Error Analysis  │
                       │  (Error Store)  │    │    Lambda       │
                       └─────────────────┘    └─────────────────┘
                                                        │
                                                        ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Amazon        │◀───│   AgentCore      │◀───│  Cognito JWT    │
│   Bedrock       │    │   Runtime        │    │     Auth        │
│ (Claude 3 Haiku)│    │ (5 AI Tools)     │    └─────────────────┘
└─────────────────┘    └──────────────────┘

Built With

Share this project:

Updates