Infrastructure from Intent - Project Story

Inspiration

A senior DevOps engineer spent 2.5 hours setting up a production VPC. Twenty-three manual steps. One misconfigured route table broke the entire environment.

When asked "Why not use Terraform?", the response: "I'd spend just as long writing 200 lines of code. Clicking seems faster."

That's when we realized: Infrastructure-as-Code didn't solve complexity—it just changed the format.

What if AWS infrastructure could be orchestrated by AI agents that understand intent, plan autonomously, and recover from errors automatically?

What it does

Infrastructure from Intent uses multi-agent AI to autonomously build AWS infrastructure from natural language requests.

Example:

uv run agentcore invoke '{
  "task": "Create production VPC with public and private subnets in 3 AZs in us-east-1. 
          Add NAT gateways for high availability. Tag all resources with 
          Environment:Production and ManagedBy:InfrastructureFromIntent"
}'

Behind the scenes:

  • 🧠 Planning Agent - Breaks request into 30+ executable steps with dependency ordering
  • ⚙️ Execution Agent - Runs AWS operations in correct sequence
  • Analysis Agent - Validates results, extracts resource IDs, provides feedback
  • 🔄 Auto-recovery - Intelligently retries failures with exponential backoff
  • 💾 State Persistence - Maintains session state via AWS AgentCore Memory

Result: Production-grade VPC in 90 seconds (vs. 45 minutes manual setup)

What gets created automatically:

  • VPC with DNS support enabled
  • 6 subnets (3 public, 3 private) across availability zones
  • Internet Gateway with proper route table configuration
  • 3 NAT Gateways with Elastic IPs for HA
  • Route tables with correct associations
  • All resources properly tagged for governance

Current MVP: Complete VPC networking automation (VPC, subnets, IGW, route tables, NAT gateways)
Vision: Full AWS service orchestration from business intent—describe what you need, not how to build it

How we built it

Architecture: Multi-agent ReAct (Reasoning + Acting) system with intelligent orchestration

Core Agents:

  • Planning Agent (qwen3-32b) - Strategic planning with dependency analysis
  • Execution Agent (AWS Gateway) - Safe, validated AWS operations
  • Analysis Agent (qwen3-32b) - Result validation and resource extraction
  • Resource Tracker - Centralized state management with AgentCore Memory

Tech Stack:

  • Python 3.11 with modern type hints and async patterns
  • Strands Agents framework for multi-agent orchestration
  • AWS Bedrock AgentCore (Runtime, Gateway, Memory)
  • AWS MCP Servers for seamless API integration
  • Claude Sonnet 4.5 for advanced reasoning

Key Implementation Details:

  1. ReAct Loop orchestrates agent coordination with clear decision boundaries
  2. Error Classification (transient vs blocking) enables intelligent recovery strategies
  3. AgentCore Memory provides durable session persistence across failures
  4. Cross-account Credential Management supports multi-account AWS organizations
  5. Structured Output with Pydantic models ensures reliable agent communication

Challenges we ran into

1. Memory API Discovery (Critical Breakthrough)

  • Initially used wrong API pattern (retrieve_memories vs list_events)
  • Namespace inconsistencies meant state was never persisted correctly
  • Deep-dived into AWS reference implementations to understand proper usage
  • Impact: Transformed state persistence from fundamentally broken to production-ready

2. Multi-Agent Coordination Reliability

  • Early prototype had agents producing inconsistent, unparseable outputs
  • Solution: Strict Pydantic data models, comprehensive JSON schemas, targeted few-shot examples
  • Improved coordination reliability from ~60% to 95%+

3. Intelligent Error Handling

  • AWS errors vary dramatically in meaning (timeout vs quota vs missing dependency)
  • Built sophisticated error taxonomy: TRANSIENT, BLOCKING, DEPENDENCY_MISSING, CONFIGURATION
  • Enables context-aware decisions: retry vs replan vs graceful failure

4. Testing Non-Deterministic Systems

  • Traditional unit testing breaks down with AI agents
  • Solution: Mock LLM responses for deterministic testing, isolate orchestration logic
  • Achieved 83% code coverage with 30 passing tests despite AI components

5. Balancing Autonomy with Safety

  • Too much autonomy → risky operations; too little → loses the point
  • Implemented approval gates for destructive operations
  • Added dry-run mode for validation without execution

Accomplishments that we're proud of

Production-grade multi-agent architecture - Three specialized AI agents working in seamless harmony
90% time reduction validated - 45 minutes manual → 90 seconds automated
Enterprise reliability - 30/30 tests passing, 83% coverage, full type safety
Intelligent auto-recovery - Handles AWS transient failures without human intervention
Cross-account orchestration - Works across AWS Organizations boundaries
Critical AWS bug discovery - Found and fixed AgentCore Memory integration issues during development
Extensible architecture - Design patterns applicable to any AWS service
Comprehensive documentation - 3000+ lines across 7 detailed documents
Real-world validation - Successfully orchestrated complex ECS + ALB + ASG deployments

What we learned

Technical Insights:

  • Multi-agent systems require strict contracts—Pydantic models and JSON schemas are non-negotiable
  • ReAct loops are perfectly suited for infrastructure orchestration patterns
  • Error recovery strategy is the difference between toy and production-grade
  • State management deserves first-class architectural consideration from day one
  • Prompt engineering is a legitimate engineering discipline requiring rigor

Product Lessons:

  • Start narrow (VPCs), architect for breadth (all AWS services)
  • Developer experience trumps feature count for adoption
  • Documentation IS the product for infrastructure tools
  • Users want to express intent, not implement procedures
  • "Simple things simple, complex things possible" is hard to achieve but worth it

Meta-learnings:

  • Hackathon projects can achieve production-grade quality with architectural discipline
  • The AI infrastructure orchestration space is wide open for innovation
  • AWS developer tools (AgentCore, MCP servers) are powerful when properly understood
  • Multi-agent systems are ready for real-world infrastructure automation

What's next for Infrastructure from Intent

Phase 1: Foundation (Next 30 Days)

  • Complete AgentCore Memory integration - Full session persistence across restarts
  • Integration testing suite - Validate against real AWS Memory resources
  • Open source release - GitHub repository with Apache 2.0 license
  • CLI enhancements - Interactive mode, progress visualization

Phase 2: Service Expansion (Q2 2025)

Example: Database Orchestration

uv run agentcore invoke '{
  "task": "Create RDS PostgreSQL 16 database with Multi-AZ deployment, 
          automated backups with 7-day retention, and read replica in us-west-2. 
          Use db.r6g.xlarge instances. Tag with Project:PaymentsAPI"
}'

Example: Complete Application Stack

uv run agentcore invoke '{
  "task": "Deploy containerized API service on ECS Fargate with ALB, 
          auto-scaling 2-10 tasks based on CPU, CloudWatch logs, 
          and X-Ray tracing enabled. Use Production VPC created earlier."
}'

Planned Services:

  • RDS Orchestration - Automated database provisioning with backups, read replicas, parameter tuning
  • S3 Management - Intelligent bucket lifecycle, cross-region replication, versioning policies
  • IAM Automation - Least-privilege policy generation from workload requirements
  • Compute Services - ECS/EKS cluster orchestration, auto-scaling configuration
  • Load Balancing - ALB/NLB with target groups, health checks, SSL termination

Phase 3: Complete Application Stacks (Q3 2025)

Single Command, Complete Infrastructure:

uv run agentcore invoke '{
  "task": "Create production microservice infrastructure for payments API:
          - Multi-AZ VPC with private subnets
          - RDS PostgreSQL with encryption and daily backups
          - ElastiCache Redis cluster for session management
          - ECS Fargate cluster with auto-scaling (2-20 tasks)
          - Application Load Balancer with SSL/TLS
          - CloudFront CDN for API responses
          - CloudWatch alarms for latency >200ms and errors >1%
          - All resources compliant with PCI-DSS tagging requirements
          - Estimated monthly budget: $800-1200"
}'

Advanced Features:

  • Dependency graph visualization
  • Cost estimation before deployment
  • Compliance policy enforcement
  • Drift detection and remediation

Phase 4: Multi-Cloud Intelligence (2026)

Cloud-Agnostic Orchestration:

uv run agentcore invoke '{
  "task": "Deploy globally distributed application:
          - Primary region: AWS us-east-1
          - DR region: Azure eastus
          - CDN: CloudFlare
          - Route53 health checks with automatic failover
          - Budget optimization: prefer AWS for compute, Azure for storage"
}'

Capabilities:

  • Azure and GCP support - Unified orchestration across cloud providers
  • Cross-cloud workload placement - Intelligent service distribution based on cost/performance
  • Cloud-agnostic abstractions - Same intent, different implementations per provider
  • Multi-cloud disaster recovery - Automated failover orchestration

Long-term Vision: The Infrastructure Operating System

Infrastructure from Intent becomes the intelligent layer between business requirements and cloud infrastructure.

Future Capability - Business-Level Requests:

uv run agentcore invoke '{
  "task": "Create infrastructure for e-commerce checkout service:
          - SLA: 99.95% uptime
          - Latency: p99 < 200ms globally
          - Scale: Handle 1000 req/sec with bursts to 5000
          - Compliance: PCI-DSS, SOC2, GDPR
          - Budget: $2000/month maximum
          - Security: Zero-trust architecture with WAF

          Optimize for cost while meeting all requirements."
}'

The system:

  1. Analyzes requirements and constraints
  2. Selects optimal AWS services and configurations
  3. Generates infrastructure with built-in observability
  4. Continuously optimizes for cost and performance
  5. Auto-remediates issues to maintain SLA

Developers describe business outcomes. AI agents handle implementation.

Infrastructure from intent, not instructions.


Why This Matters

Traditional IaC tools (Terraform, CloudFormation, Pulumi) require deep expertise in both the tool and cloud provider. They've lowered the barrier from clicking consoles to writing code, but the cognitive load remains high.

Infrastructure from Intent represents the next paradigm:

  • Natural language → Running infrastructure
  • Business intent → Technical implementation
  • Self-healing by default
  • Continuous optimization without manual tuning

This is infrastructure for the AI era—where systems understand what you need and figure out how to build it.


Infrastructure from Intent
Intelligent infrastructure orchestration, from natural language.

Built on AWS Bedrock AgentCore • 90% faster deployment • Production-ready • Open source

Built With

Share this project:

Updates