Inspiration

Modern software development moves at lightning speed, yet debugging and triaging issues still consume a significant portion of engineering time.

We aimed to create a system that automatically detects, analyzes, and fixes runtime issues before developers even notice them like a cloud-native AI DevOps assistant.

The inspiration came from:

  • AWS’s advancements in Bedrock and CloudWatch anomaly detection
  • Frustration with late-night debugging sessions
  • The realization that AI agents could autonomously analyze logs and suggest fixes

We asked ourselves:

“What if incidents could fix themselves?”

From that question, the AutoTriage & AutoFix Agent was born.


What it does

AutoTriage & AutoFix Agent is a cloud-based DevOps agent that continuously:

  1. Monitors AWS logs, metrics, and API responses using CloudWatch and OpenTelemetry
  2. Triages issues using an LLM via Amazon Bedrock
  3. Generates root cause analysis (RCA) reports for developers
  4. Proposes auto-fixes by suggesting code changes or pull requests via GitHub integration
  5. Optionally verifies deployments in sandbox environments like AWS Lambda

Additionally, the project includes a frontend dashboard built with HTML, CSS, and JavaScript, providing a user-friendly interface to interact with the system.


How we built it

We built the agent using an AWS-native backend-focused stack for scalability and integration:

  • Backend:
    • AWS Lambda functions triggered by GitHub webhooks or events
    • Amazon Bedrock for LLM reasoning and suggestion generation
    • AWS CloudWatch + OpenTelemetry for monitoring logs and metrics
    • GitHub Actions for applying fixes or suggesting pull requests
  • Frontend:
    • User interface built with HTML, CSS, and JavaScript, providing a dashboard to monitor and manage the agent
  • Data handling:
    • Logs and context are temporarily processed in Lambda; no long-term storage is included yet
  • Automation:
    • The agent detects issues → queries Bedrock → generates suggestions → optionally posts GitHub comments

Mathematical abstraction:

[ f_{\text{autoFix}}(x) = \text{LLM}_{\text{Bedrock}}(\text{logs}(x) + \text{repo context}) ]

Where ( f_{\text{autoFix}} ) is the function mapping observed logs and repository context to actionable code fixes.


Challenges we ran into

  • Prompt accuracy: LLMs could hallucinate fixes; mitigated by grounding prompts in actual logs and context.
  • Security: Giving an AI model access to repositories required careful IAM role isolation and GitHub token management.
  • Trigger reliability: Ensuring GitHub webhooks consistently reach Lambda functions required retries and logging.
  • Cost management: Running LLM queries through Bedrock needed careful usage to avoid unnecessary costs.
Share this project:

Updates