Architecture Picture

Architecture Documentation

System Overview

The AWS Error Detection and Ranking system is an intelligent AI-powered solution that automatically monitors, analyzes, and prioritizes CloudWatch error logs. Built for the AWS Agent Hackathon, it leverages Amazon Bedrock, AgentCore, and various AWS services to transform error management from reactive firefighting to proactive, AI-driven prioritization.

Project Background

Inspiration

Internally, our company faced a critical operational challenge: monitoring thousands of CloudWatch logs across multiple services and environments. Development teams were overwhelmed by the sheer volume of error logs, spending hours manually sifting through CloudWatch to identify which errors required immediate attention.

The core problem was the lack of context and prioritization. Not all errors are created equal:

Some errors indicate critical internal bugs requiring immediate developer action
Others represent external service issues completely outside our control
Many are transient issues that self-resolve without intervention

Without intelligent context and priority levels, teams wasted valuable time investigating non-actionable errors while critical issues sometimes went unnoticed in the noise. We needed a solution that could automatically separate the signal from the noise, providing developers with actionable intelligence rather than raw logs.

What It Does

Our solution creates an intelligent, automated error management pipeline:

Automatic Detection: Continuously monitors CloudWatch log groups across your AWS infrastructure
Intelligent Analysis: Uses Amazon Bedrock's Claude 3 Haiku via AgentCore to analyze each error with sophisticated reasoning
Context Addition: Enriches errors with:
- Severity assessment (low/medium/high)
- Priority ranking (1-5 scale)
- Actionability flag (fixable by developers?)
- Category classification (API, Database, Lambda, Network, etc.)
- Human-readable explanation of root cause
Knowledge Base Learning: Builds institutional knowledge by:
- Storing error patterns and resolutions
- Matching new errors against historical data
- Tracking resolution success rates
- Continuously improving recommendations
Smart Filtering: Separates critical internal bugs from external dependencies, allowing teams to focus on what they can actually fix

The result: Developers receive a prioritized, contextualized list of errors with clear guidance on what needs immediate attention versus what can be safely deprioritized.

How We Built It

Our architecture leverages multiple AWS services in a serverless, event-driven design:

Phase 1: Infrastructure Foundation

Built CDK infrastructure for the complete error detection pipeline
Implemented Step Functions state machine to orchestrate log retrieval
Created Lambda functions for log processing and storage
Deployed DynamoDB tables for error logs, knowledge base, and session management

Phase 2: AI Agent Development

Developed custom MCP (Model Context Protocol) tools for error analysis
Integrated Strands framework for AgentCore agent orchestration
Connected Claude 3 Haiku via Amazon Bedrock for intelligent reasoning
Implemented 5 specialized tools:
- AWS Health Checker for service status correlation
- Knowledge Base Search for historical pattern matching
- CloudWatch Metrics analyzer for performance trends
- Error Knowledge Storage for learning
- Resolution Tracking for success rate monitoring

Phase 3: Integration & Security

Containerized AgentCore runtime on ECS Fargate
Implemented Cognito JWT authentication with token caching
Built HTTPS integration between Lambda and AgentCore with automatic fallback
Added DynamoDB streams to trigger real-time error analysis

Phase 4: User Interface

Created React frontend for error visualization
Built Flask API backend for data access
Implemented filtering and sorting capabilities
Added real-time updates as errors are analyzed

Technology Stack:

AWS CDK for infrastructure as code
Python for Lambda functions and agent logic
Strands framework for agent orchestration
Amazon Bedrock (Claude 3 Haiku) for AI reasoning
Docker for AgentCore containerization
React + Flask for UI/API
DynamoDB for storage
Step Functions for workflow orchestration

Challenges We Ran Into

Incomplete CDK Resources: During development, AWS CDK constructs for AgentCore were incomplete. At the moment of writing this document, this have been implemented, but when we worked on the project the necessary resources weren't yet implemented. This is understandable since AgentCore is in preview.
Authentication and Authorization: Implementing secure HTTPS communication between Lambda and AgentCore proved challenging.
Knowledge Base Cost Effectiveness: Early architecture included OpenSearch for knowledge base but the minimum OpenSearch cluster cost was too high for a proof of concept so that was replaced with DynamoDB for cost efficiency. We still have a plan to re-introduce OpenSearch Serverless in production

Accomplishments We're Proud Of

Completion of the Solution:Despite the challenges, we delivered a complete, working solution from infrastructure to UI. The system successfully processes CloudWatch logs automatically, analyzes errors with AI-powered reasoning, stores knowledge for continuous learning, and presents results in an intuitive interface
Multiple Customer Projects: Our team balanced multiple customer projects while building this hackathon solution. Everyone contributed their expertise from infrastructure engineers building the robust CDK implementations, to AI/ML specialists designing the agent architecture, to our Full-stack developers creating the necessary UI.
Production Ready Solution: We didn't just build a demo, we created a production-ready system with security through Cognito JWT authentication, cost optimization, monitoring and observability built in and scalable serverless architecture.
Self-learning Knowledge Base: The self-learning knowledge base is particularly impressive. It can automatically improve recommendations over time, tracks resolution success rates, builds institutional knowledge without manual curation, and provides data-driven guidance for error resolution

What We Learned

Technical Learnings:

Strands Framework Mastery: Gained deep understanding of building agents with the Strands framework, including proper tool definition and registration, system prompt engineering for consistent behavior and multi-tool orchestration strategies.
MCP Tool Development: Learned best practices for creating Model Context Protocol tools.
Knowledge Base Design: Understood the nuances of error signature generation and normalization, similarity scoring algorithms, pattern extraction from unstructured logs and balancing storage vs. query performance.

Process Learnings:

MVP First: Starting with minimal viable product and iterating proved more effective than trying to build everything at once
Fallback Planning: Having backup plans (direct Bedrock calls) when primary approach (AgentCore) fails ensures system resilience
Cost Awareness: Early cost analysis (discovering OpenSearch expense) saved significant budget
Documentation Gaps: When official docs are incomplete, community collaboration and experimentation become essential

What's Next for Error Detection and Context Addition

This is a fully functional proof of concept, but several enhancements will make it production-ready for enterprise use:

Short-Term Improvements (1-3 months):

OpenSearch Serverless Integration: Replace DynamoDB knowledge base with OpenSearch Serverless, which enables advanced similarity search with vector embeddings, improves pattern matching accuracy by 40-50%, and adds full-text search across error histories
Human-in-the-Loop Tagging: Add UI for developers to confirm/correct AI categorizations to build feedback loop to improve accuracy
Real-Time Streaming: Replace scheduled state machine with CloudWatch Logs Subscription, which would reduce detection latency from minutes to seconds
Slack Notifications: Add instant Slack/PagerDuty notifications for critical errors
Enhanced UI Features: UI features that we could add would be error trend visualization with charts, filtering by service, severity, and time range, and historical error comparison view
Integration with CI/CD pipelines: This would make this stack easier to maintain.

Medium-Term Enhancements (3-6 months):

Multi-Account Support: Cross-account CloudWatch log access via IAM roles with a centralized error dashboard across AWS accounts with account-level filtering and permissions.
Custom ML Models: Train classification models on resolved error data with specialized models for different error categories.
Advanced Analytics: Error trend prediction using time-series analysis, anomaly detection for unusual error patterns, impact analysis (errors affecting multiple services), and cost correlation (errors causing resource waste) could be implemented in the future.

Long-Term Vision (6-12 months):

Proactive Error Prevention: Analyze code changes for potential error patterns, automated suggestions for error handling improvements.
Root Cause Analysis: Trace errors across distributed systems, correlate errors with deployments, config changes, and suggest architectural improvements.
Self-Healing Capabilities: Automatic remediation for known error patterns, configuration rollback on error spikess

The foundation is solid, and the path forward is clear. This project demonstrates the power of combining AWS services, AI agents, and knowledge management to solve real developer pain points.

Core Components

1. Log Ingestion Layer

EventBridge Scheduler

Purpose: Triggers periodic log generation for testing and demonstration
Frequency: Configurable schedule (default: every 5 minutes)
Trigger: Invokes State Machine for orchestrated processing

CloudWatch Log Groups

Purpose: Source of all error logs
Integration: Native AWS service integration
Access Pattern: Read-only via AWS SDK

2. Processing Orchestration Layer

Step Functions State Machine

Orchestrates the entire log processing workflow with the following states:

Get Log Groups → Retrieves all available CloudWatch log groups
Filter Log Streams → Identifies recent log streams with activity
Process Logs → Extracts error entries and stores in DynamoDB
Parallel Processing → Handles multiple log groups concurrently

State Machine Flow:

Start
  ↓
GetLogGroups (Lambda)
  ↓
FilterStreams (Lambda) [Map State - Parallel]
  ↓
ProcessLogs (Lambda) [Map State - Parallel]
  ↓
Success/Failure

3. Data Storage Layer

DynamoDB Tables

Error Logs Table

Primary Key: error_id (timestamp-based UUID)
Attributes:
- log_group: Source log group name
- log_stream: Source log stream name
- timestamp: Error occurrence time
- message: Raw error message
- severity: AI-assigned severity level
- analysis: Bedrock-generated context
- priority: Criticality ranking
- category: Error classification
- actionable: Boolean flag for developer action needed
Streams: Enabled for real-time error analysis triggering
TTL: Optional retention policy

Session Table

Purpose: Manages AgentCore session state
Primary Key: session_id
Attributes:
- jwt_token: Cached Cognito JWT
- expires_at: Token expiration timestamp
- created_at: Session creation time

4. AI Analysis Layer

Lambda: Error Analysis Trigger

Trigger: DynamoDB Stream from Error Logs Table
Batch Size: 10 records per invocation
Primary Function: Orchestrates communication with AgentCore
Authentication: Cognito JWT token management with caching
Fallback: Direct Bedrock API calls if AgentCore unavailable

AgentCore Runtime (ECS Fargate)

Container: Custom Docker image with Strands framework
Model: Claude 3 Haiku via Amazon Bedrock
Authentication: Cognito JWT validation
Networking: Private subnet with ALB for HTTPS access
Scaling: Auto-scaling based on CPU/memory

AgentCore Architecture:

┌──────────────────────────────────────┐
│        AgentCore Container           │
├──────────────────────────────────────┤
│  Strands Framework Agent             │
│    ├─ Claude 3 Haiku Integration     │
│    ├─ Tool Registry                  │
│    └─ Session Management             │
├──────────────────────────────────────┤
│  MCP Tools (5 specialized)           │
│    ├─ aws_health_checker             │
│    ├─ search_similar_errors          │
│    ├─ cloudwatch_metrics             │
│    ├─ store_error_knowledge          │
│    └─ update_error_resolution        │
├──────────────────────────────────────┤
│  AWS SDK Integrations                │
│    ├─ DynamoDB Client                │
│    ├─ CloudWatch Client              │
│    └─ Health Client                  │
└──────────────────────────────────────┘

Architecture Diagram

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   EventBridge   │───▶│  State Machine   │───▶│   CloudWatch    │
│  (Scheduler)    │    │  (Orchestrator)  │    │   Log Groups    │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │                        │
                                ▼                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │ Process Logs    │◀───│ Filter Streams  │
                       │    Lambda       │    │    Lambda       │
                       └─────────────────┘    └─────────────────┘
                                │
                                ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │    DynamoDB     │───▶│ Error Analysis  │
                       │  (Error Store)  │    │    Lambda       │
                       └─────────────────┘    └─────────────────┘
                                                        │
                                                        ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Amazon        │◀───│   AgentCore      │◀───│  Cognito JWT    │
│   Bedrock       │    │   Runtime        │    │     Auth        │
│ (Claude 3 Haiku)│    │ (5 AI Tools)     │    └─────────────────┘
└─────────────────┘    └──────────────────┘

Built With

agentcore
amazon-web-services
bedrock
lambda
python
react
strands
typescript

Submitted to

AWS AI Agent Global Hackathon

Created by

Created the AWS CDK Resources, log retrieval and filtering Step Function, Knowledge Base and Dynamodb Table db architecture, basic framework for the agent, and it's activation with Dynamodb Table Streams, and Project Management.

ElielSkillwell Taskinen
Connected Lambda to do https-calls for AgentCore.
Implemented a fallback for bedrock, should the AgentCore be unavailable at the time.
Cleaned some of the unused code, updated the usage of Claude 3 Haiku model instead of 3.7 Sonnet.

Aleksi Hakala
I made a quick UI for displaying the analysis of the errors detected by our AI agent. The UI is a minimal approach so that we can visualize the analyses from DynamoDB simply and bit more user friendly than just looking at the table, nothing fancy :)

Rasmus Savolainen
I worked on implementing AI agents tools, Strands framework and the AgentCore implementation. It was my first time working with AI agents in a comprehensive project, and it was pretty challenging at first but in the end got the gist of it. Learned a lot about AI agents and how to implement those technologies in practice and it gave me a strong foundation for future AI agent projects.

MikkoSkillwell
Jari Ikävalko

Updates

ElielSkillwell Taskinen started this project — Oct 20, 2025 07:46 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.