Cloudops agent

architectural diagram

Inspiration

We saw teams struggling with fragmented AWS tooling, rising costs, misconfigurations, and constant firefighting. The dream: a single intelligent agent that understands your cloud environment, reasons about it, and takes action — turning reactive operations into proactive, autonomous management.

What It Does

The CloudOps Agent continuously monitors AWS services (metrics, logs, configurations, costs). It uses reasoning to detect inefficiencies, security risks or performance issues, then executes workflows (via AgentCore) or suggests fixes. It optimizes cost, enforces compliance, and automates operations — all with minimal human intervention.

How We Built It

Data ingestion: AWS EventBridge + Lambda capture CloudWatch, CloudTrail, AWS Config and Cost Explorer data.

Reasoning layer: We leveraged Amazon Bedrock’s LLM to interpret cloud telemetry and generate insights.

Execution engine: AgentCore drives workflows and calls AWS APIs (EC2, IAM, S3) to remediate, scale or fix.

Optional ML: Amazon SageMaker models detect anomalies and predict cost spikes.

UI/Interaction: API Gateway + Slack/Console chat interface allow users to query the agent and view recommendations.

Persistence: DynamoDB and S3 store config states, logs and audit trails.

Infrastructure: Deployed in a modular, event-driven architecture, fully on AWS.

Challenges We Ran Into

Aligning LLM responses with AWS context: mapping telemetry into meaningful prompts for the reasoning engine.

Ensuring safe automation: designing workflows with guardrails so the agent can act autonomously without risking live systems.

Multi-account and multi-region complexity: handling different AWS accounts, permissions, and data aggregation.

Cost vs. value: balancing the compute/AI cost of the agent itself versus the savings it provides.

User trust: building an interface where users understand “why” the agent is taking actions, not just “what”.

Accomplishments That We’re Proud Of

Built a full working demo agent that detects an EC2 cost anomaly, recommends a right-sizing action, and executes the change in minutes.

Achieved a unified dashboard and chat interface where you can ask “What’s my highest cost service this week?” and get a reasoned answer with actionable steps.

Automated remediation that reduced drift in test environments by X% (or identify a metric).

Successfully integrated the reasoning engine with walking through logs, metrics and cost data to surface root causes (not just symptoms).

What We Learned

Autonomous agents must be transparent: users need visibility into decisions, otherwise they won’t trust automated actions.

Event-driven architectures scale better for this type of system than periodic scanning.

LLMs are powerful for reasoning but still need structured prompts, context enrichment and fine-tuning to work reliably in cloud operations.

Automation is only as good as the policies you embed: defining meaningful workflows, safe default actions, and rollback plans are critical.

Real value comes when you combine cost, performance and security insights into one coherent system — not separate tools.

What’s Next for CloudOps Agent

Add predictive capabilities: using historical data and SageMaker to forecast budget overruns or scaling events.

Expand multi-cloud support: beyond AWS, integrate with Azure or GCP to offer a true unified cloud ops agent.

Improve explainability: provide deeper reason-why explanations and decision-paths for audit/compliance.

Offer a marketplace of workflow templates: let users deploy pre-built automations for common tasks like patching, scaling, S3 lifecycle.

Introduce self-learning workflows: allow the agent to observe operator approvals and refine its decision-making over time.

Built With

Updates

Onyango Nyakiti started this project — Oct 22, 2025 05:59 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.