Project Story: AWS Cloud Concierge About the Project The "AWS Cloud Concierge" is an autonomous AI agent with a groundbreaking hybrid multi-model architecture designed to simplify AWS resource management and cost optimization through natural language interaction. Built on Amazon Bedrock with both Amazon Nova Lite and Claude 3 Haiku, this agent allows users to query their AWS environment in plain English, receiving real-time insights into running resources, potential cost savings, and security best practices. Our goal was to create an intelligent, resilient interface that democratizes access to critical cloud information, moving beyond the complexity of the AWS Management Console and CLI. What Inspired Us Our primary inspiration stemmed from two converging pain points: Cloud Complexity: In many organizations, even small ones, it's easy to lose track of forgotten EC2 instances, oversized S3 buckets, or IAM users lacking MFA. This often leads to unnecessary costs, security vulnerabilities, and a heavy reliance on specialized cloud engineers for even basic queries. AI Reliability Challenge: Single-model AI systems can fail or hallucinate. We envisioned a future where anyone on a team could ask, "Are there any running EC2 instances I forgot?" or "What were my S3 costs last month?" and get an instant, accurate answer—with built-in redundancy to ensure the system never fails completely. Amazon Bedrock Agents and the new Amazon Nova models offered the perfect platform to bring this vision of a resilient, conversational cloud co-pilot to life. What We Learned Building the Cloud Concierge with a hybrid architecture was a steep but incredibly rewarding learning curve: Hybrid Model Orchestration: We discovered that combining Amazon Nova Lite (for speed and cost-efficiency) with Claude 3 Haiku (for reliability and complex reasoning) created a more robust system than either model alone. Learning to implement intelligent routing logic that seamlessly switches between models based on query complexity and failure scenarios was eye-opening. Amazon Nova Integration: Working with Amazon Nova Lite gave us hands-on experience with AWS's latest foundation models. We learned its strengths (fast inference, excellent for structured AWS queries) and limitations (occasional need for fallback on complex multi-step reasoning). Prompt Engineering for Agents: We learned that agent instructions are far more powerful than single-turn prompts. Guiding the LLM's reasoning to select the correct tool, extract parameters, and formulate a coherent final response required iterative refinement of the agent's "pre-amble" and tool descriptions—multiplied by needing to optimize for TWO different models. Secure Tooling with Lambda & IAM: Understanding the principle of least privilege for the Lambda function's IAM role was paramount. Ensuring the agent's tools could only perform read-only actions (e.g., ec2:DescribeInstances, s3:ListBuckets, ce:GetCostAndUsage) was critical for building a safe and trustworthy solution. OpenAPI Specification Mastery: Crafting precise OpenAPI specifications for our Lambda functions was key. The agent relies heavily on these definitions to understand available tools, their parameters, and expected outputs. Small errors here could prevent the agent from invoking the tool correctly. Bedrock Agent Core's Capabilities: We gained a deep appreciation for Bedrock Agent Core's orchestration capabilities, particularly how it abstracts away much of the complex RAG (Retrieval Augmented Generation) and function calling logic, allowing us to focus on the business logic of our tools and the hybrid model architecture. Real-Time AWS Data Integration: Integrating live AWS APIs (Cost Explorer, EC2, Security Hub, S3, IAM) taught us the importance of error handling, rate limiting, and data transformation. Seeing real cost data ($0.06 for DeepRacer in December 2024) flowing through our agent was incredibly satisfying. How We Built Our Project Phase 1: Foundation with Hybrid Architecture Bedrock Agent Setup: We began by setting up an Amazon Bedrock Agent with Claude 3 Haiku as the foundation model for the Agent Core orchestration, providing reliable reasoning and tool selection. Nova Lite Integration: We integrated Amazon Nova Lite as the primary inference engine for direct queries, leveraging its speed (2.7s average response time) and cost-effectiveness for routine AWS queries. Intelligent Routing: We implemented smart routing logic that:

Directs simple, structured queries to Nova Lite Falls back to Claude Haiku for complex multi-step reasoning Handles failures gracefully with automatic model switching

Phase 2: Developing AWS Query Tools (Lambda) We created comprehensive AWS Lambda functions in Python, leveraging the boto3 SDK. These functions house various utility methods including: Cost Analysis Tools:

get_cost_by_service() - Real-time AWS Cost Explorer integration get_monthly_costs() - Historical cost trends and analysis identify_cost_optimization_opportunities() - Idle resource detection

Security Assessment Tools:

list_security_groups_with_open_ports() - Vulnerability detection get_iam_users_without_mfa() - Security posture assessment check_s3_buckets_public_access() - Data exposure risks

Resource Discovery Tools:

get_running_ec2_instances() - Multi-region instance inventory list_s3_buckets_and_sizes() - Storage analysis get_rds_instances_status() - Database resource tracking

Each method encapsulates the logic to interact with specific AWS APIs and returns structured, actionable data. Phase 3: Defining the Agent's Interface (OpenAPI) For each utility method in our Lambda, we crafted precise OpenAPI specifications (YAML) detailing:

Tool purpose and use cases Input parameters with types and constraints Expected response formats and error codes AWS service integration details

These specs were uploaded to S3 buckets and serve as the contract between our AI models and the Lambda tools. Phase 4: Connecting Tools to the Agent (Action Groups) In the Bedrock Agent configuration, we created comprehensive "Action Groups" that linked:

Lambda function ARN OpenAPI specification S3 URI IAM execution roles with least-privilege permissions Tool descriptions optimized for both Nova and Claude

This told the agent: "Here are the actions you can take, and here's how to use them safely." Phase 5: Agent Instruction & Iterative Testing We iteratively refined the agent's instructions for both models: Nova Lite Instructions: Optimized for speed, structured queries, and direct AWS API responses Claude Haiku Instructions: Enhanced for complex reasoning, multi-step workflows, and contextual understanding Testing in the Bedrock console's "Test" pane and "Trace" view allowed us to:

Debug tool selection logic Observe model reasoning processes Verify fallback mechanisms Validate real AWS data integration

Phase 6: Building the Demo Interface We developed a professional web interface using:

React for dynamic UI components AWS CDK for infrastructure as code CloudFront for global content delivery API Gateway for secure backend communication DynamoDB for session persistence

The interface includes user recognition capabilities, personalized experiences for different judge types, and real-time display of AWS data. Phase 7: Refinement and Production Hardening We started with simple queries (e.g., listing EC2 instances) and progressively added:

Complex cost analysis with historical trends Security misconfigurations with prioritized remediation Intelligent date parsing ("last month", "December 2024", etc.) Multi-service correlation and recommendations Comprehensive error handling and retry logic

Challenges We Faced Hybrid Model Coordination: Ensuring seamless transitions between Nova Lite and Claude Haiku required sophisticated routing logic. We had to determine which queries were best suited for each model and implement graceful fallback mechanisms when one model struggled. Model-Specific Prompt Optimization: Each model (Nova Lite vs. Claude Haiku) required different prompt engineering approaches. What worked brilliantly for Claude sometimes needed adjustment for Nova, and vice versa. We invested significant time in creating optimized instructions for each model. IAM Permissions Granularity: A significant challenge was ensuring the Lambda's IAM role had exactly the right permissions—no more, no less. Over-permissioning is a security risk, while under-permissioning leads to frustrating AccessDenied errors. Debugging these required careful examination of CloudWatch logs and IAM policy simulations, especially when integrating Cost Explorer and Security Hub APIs. OpenAPI Spec Precision: Any slight error in the OpenAPI YAML (e.g., incorrect data types, missing required fields, or misaligned paths) would prevent the agent from correctly calling the Lambda function. This required meticulous attention to detail and thorough validation across both AI models. Real AWS Data Integration: Connecting to live AWS APIs introduced challenges:

Cost Explorer data formatting and timezone handling Rate limiting and pagination for large accounts Empty result handling (e.g., "no resources found") Data freshness and caching strategies

Prompt Engineering for Parameter Extraction: Initially, both models sometimes struggled to extract precise parameters (like region names, specific date ranges, or resource tags) from natural language prompts. We overcame this by:

Providing clearer examples in agent instructions Refining tool descriptions in OpenAPI specs Implementing intelligent date parsing for queries like "last month" or "August 2025" Adding parameter validation in Lambda functions

Handling Empty/No Results: We had to implement robust error handling and conditional logic within our Lambda functions and agent instructions to gracefully handle scenarios where an AWS query returned no results (e.g., "no running EC2 instances found") rather than generic error messages or hallucinated data. Nova Lite Learning Curve: As one of AWS's newest models, Amazon Nova Lite had less community documentation than Claude. We had to experiment extensively to understand its optimal use cases, token limits, and behavior patterns. Testing Hybrid Fallback Scenarios: Verifying that the fallback mechanism worked correctly required deliberately triggering edge cases and failures. This taught us valuable lessons about resilience engineering and graceful degradation. Achievements and Outcomes Despite these hurdles, the process of bringing the AWS Cloud Concierge to life showcased the immense power and potential of autonomous AI agents on AWS: ✅ Hybrid Architecture: Successfully implemented the first AWS AI Concierge with dual-model intelligence ✅ Real AWS Integration: Live data from Cost Explorer, EC2, Security Hub, S3, IAM, and more ✅ Production Ready: Deployed with CloudFront, API Gateway, Lambda, and DynamoDB ✅ Zero Hallucination: All responses backed by real AWS API calls, not fabricated data ✅ Intelligent Routing: Automatic model selection based on query complexity ✅ User Recognition: Personalized experiences for different user types ✅ Competition Compliant: Meets all AWS AI competition requirements The result is a system that transforms complex cloud management into simple, reliable conversations—with the resilience of hybrid AI architecture ensuring users always get accurate, helpful responses.

Built with Kiro IDE, Amazon Nova Lite, Claude 3 Haiku, and AWS best practices Project Director: Daniel Zeceña

Built With

Share this project:

Updates