Inspiration
Fixing a production API at 2 AM is a special kind of pain I wouldn't wish on my worst enemy. But surviving the on-call fire and bringing the servers back to life while the world sleeps? That is our inspiration. We built EpiTrace to automate the midnight grind and let the servers heal themselves.
What it does
EpiTrace is an autonomous API monitoring and self-healing pipeline.
- Continuous Monitoring: Tracks API endpoints and logs uptime and downtime.
- Incident Queuing: Detects failures (HTTP errors or timeouts) and pushes incident jobs to a dedicated queue.
- Automated Root-Cause Analysis: Uses the Cline CLI to autonomously analyze the failure and generate a detailed diagnostic report.
- Smart Alerting: Pushes the incident analysis directly to your team's webhooks (Slack/Discord).
- One-Click Auto-Fix: Provides a trigger link in the alert to start the automated code-fixing process.
- Test-Aware Healing Loop: After generating a code fix, the agent runs unit tests to check if errors are present. If the tests pass, it goes ahead. If errors are still present, it re-evaluates and modifies the code until the errors are solved.
- Seamless Deployment: Once validated, it creates a new Git branch, commits the changes, pushes to GitHub, and opens a Pull Request automatically.
How we built it
We built the system using a decoupled architecture consisting of an API Server and VM Worker Services, orchestrated by Bash scripts for reliable execution.
- The Backend (Node.js + Express + PostgreSQL): Handles the core API routes, monitor management, alerts, user data, GitHub token mapping, and queue producers.
- The Queue Backbone (Upstash Redis + BullMQ): Connects the detection, analysis, and auto-fix stages asynchronously.
- The Worker Pipeline (EC2 + PM2):
- Analysis Worker: Continuously checks monitor endpoints. If an endpoint goes down, it enqueues a job.
- Down Worker: Picks up the incident job and runs a deterministic Bash script (
run-cline-job.sh). This clones the target repo, prompts Cline for an analysis, and routes the findings back to the server's alert endpoint. - Code Worker: Consumes jobs from the
code_queuewhen triggered by a user. It clones the repo via GitHub token auth, creates a fix branch, prompts Cline to write the fix, and strictly runs the test-aware loop. Once validated, it pushes the code and uses theghCLI to open a PR.
Challenges we ran into
- Execution Overhead: Our initial Docker approach was far too slow, with pipelines taking 20+ minutes. We pivoted to running directly on EC2 for bare-metal process control, which drastically improved iteration speed.
- Environment & State Drift: Syncing environment variables across local and remote EC2 setups caused broken webhook URLs and mismatched DB connections.
- PM2 Caching Quirks: Updating
.envfiles didn't immediately change the running worker processes. Workers held onto stale URLs and IPs until we flushed and completely recreated the PM2 environments. - Queue Desync: Temporary mismatching between the Server and Worker Redis instances resulted in jobs being enqueued but floating in the void, completely ignored by the workers.
- Parsing Terminal Noise: Standard
gitconsole output (like cloning or branching) was bleeding into ourstderrstreams, triggering false-positive errors. Additionally, slight string mismatches in our Bash output parsers caused the workers to fail even when a PR was successfully created. - Operational Friction: Tuning EC2 security groups, fixing intermittent SSH issues, bypassing global
npmpermission errors for the Cline CLI, and handling graceful worker shutdowns took several iterations to get stable. - Discord Webhook Failures: During final testing, our Discord webhooks unexpectedly stopped delivering alerts (likely due to strict payload requirements or rate limits). To keep our momentum and ensure a flawless presentation, we rapidly pivoted to using webhook.site to successfully demonstrate the real-time alerting functionality for our demo.
Accomplishments that we're proud of
- Successfully automated a true production-style incident flow: from detection to analysis, right through to a validated PR.
- Transformed an AI coding assistant (Cline) into an autonomous infrastructure component embedded directly in a worker queue.
- Engineered a reliable, self-healing workflow with a strict test-aware feedback loop to ensure AI-generated code is actually valid before promotion.
- Shipped a resilient, multi-worker architecture that runs smoothly and reliably on EC2.
What we learned
- How to architect complex, decoupled queue-based workflows using BullMQ and Redis.
- The critical importance of separating analysis tasks from execution tasks into dedicated, isolated workers.
- Advanced DevOps debugging: tracing state changes across APIs, queues, bash scripts, and remote PM2 instances.
- The necessity of building test-aware guardrails when letting AI write and push code to a repository.
What's next for EpiTrace
- Transition to a SaaS platform: We plan to rewrite and expand the infrastructure to support multi-tenant architecture, allowing different organizations to onboard and use the service.
- Expanded Integrations: Adding deeper support for varied CI/CD pipelines and bringing in more advanced telemetry metrics to feed the AI analysis engine.
Built With
- amazon-web-services
- bash-script
- bullmq
- nextjs
- node.js
- radis
Log in or sign up for Devpost to join the conversation.