Inspiration
Modern software development moves at lightning speed, yet debugging and triaging issues still consume a significant portion of engineering time.
We aimed to create a system that automatically detects, analyzes, and fixes runtime issues before developers even notice them like a cloud-native AI DevOps assistant.
The inspiration came from:
- AWS’s advancements in Bedrock and CloudWatch anomaly detection
- Frustration with late-night debugging sessions
- The realization that AI agents could autonomously analyze logs and suggest fixes
We asked ourselves:
“What if incidents could fix themselves?”
From that question, the AutoTriage & AutoFix Agent was born.
What it does
AutoTriage & AutoFix Agent is a cloud-based DevOps agent that continuously:
- Monitors AWS logs, metrics, and API responses using CloudWatch and OpenTelemetry
- Triages issues using an LLM via Amazon Bedrock
- Generates root cause analysis (RCA) reports for developers
- Proposes auto-fixes by suggesting code changes or pull requests via GitHub integration
- Optionally verifies deployments in sandbox environments like AWS Lambda
Additionally, the project includes a frontend dashboard built with HTML, CSS, and JavaScript, providing a user-friendly interface to interact with the system.
How we built it
We built the agent using an AWS-native backend-focused stack for scalability and integration:
- Backend:
- AWS Lambda functions triggered by GitHub webhooks or events
- Amazon Bedrock for LLM reasoning and suggestion generation
- AWS CloudWatch + OpenTelemetry for monitoring logs and metrics
- GitHub Actions for applying fixes or suggesting pull requests
- AWS Lambda functions triggered by GitHub webhooks or events
- Frontend:
- User interface built with HTML, CSS, and JavaScript, providing a dashboard to monitor and manage the agent
- User interface built with HTML, CSS, and JavaScript, providing a dashboard to monitor and manage the agent
- Data handling:
- Logs and context are temporarily processed in Lambda; no long-term storage is included yet
- Logs and context are temporarily processed in Lambda; no long-term storage is included yet
- Automation:
- The agent detects issues → queries Bedrock → generates suggestions → optionally posts GitHub comments
- The agent detects issues → queries Bedrock → generates suggestions → optionally posts GitHub comments
Mathematical abstraction:
[ f_{\text{autoFix}}(x) = \text{LLM}_{\text{Bedrock}}(\text{logs}(x) + \text{repo context}) ]
Where ( f_{\text{autoFix}} ) is the function mapping observed logs and repository context to actionable code fixes.
Challenges we ran into
- Prompt accuracy: LLMs could hallucinate fixes; mitigated by grounding prompts in actual logs and context.
- Security: Giving an AI model access to repositories required careful IAM role isolation and GitHub token management.
- Trigger reliability: Ensuring GitHub webhooks consistently reach Lambda functions required retries and logging.
- Cost management: Running LLM queries through Bedrock needed careful usage to avoid unnecessary costs.
Log in or sign up for Devpost to join the conversation.