Inspiration

E-commerce and logistics systems frequently experience "limbo" states—where database discrepancies or stock outrages leave orders hanging in internal systems, risking immediate SLA breaches and costing companies massive revenue. Resolving these incidents traditionally requires manual intervention from database administrators (DBAs), leading to long delays. We wanted to build an autonomous agent that handles these incident diagnoses instantly right from the company's communication hub, Slack, without sacrificing security.

What it does

SQL Ops Agent (interacting in Slack as SQL Inventory Agent) is a self-healing operational assistant that monitors, diagnoses, and safely resolves database issues.

Autonomous Diagnosis: When tagged about a stuck order, the agent uses LangGraph to query the database and assess the inventory bottleneck.

MCP Integration: It leverages the Model Context Protocol (MCP) to dynamically calculate critical SLA deadlines, giving human admins instant visibility into problem prioritization.

Smart Remediation: If stock is exhausted locally, it searches for alternative warehouses and drafts the exact SQL mutation needed to fix the issue.

Human-in-the-Loop Guardrails: Instead of running wild queries directly on production datasets, it pauses and posts interactive Approve or Reject cards into Slack. Once approved, it executes the change and automatically balances the warehouse inventory logs.

How we built it

We engineered a stateful, event-driven orchestration system built entirely on Python:

Core Engine: Driven by LangGraph to manage states, transition rules, and native interrupt workflows for human oversight.

API & Integration: Powered by FastAPI and Slack Bolt Async to process real-time events, handle slash payloads, and maintain zero-latency interactivity paths.

Data Integration: Utilized SQLite to replicate a multi-warehouse e-commerce database, coupled with a customized MCP (Model Context Protocol) server acting as an independent protocol tunnel for SLA evaluation.

Deployment: Hosted securely on Render Cloud Architecture, utilizing optimized memory paths (/tmp sandboxing) to achieve lightweight execution on free-tier containers.

Challenges we ran into

Stateful Interrupt Routing: Ensuring that clicking a button in Slack would seamlessly resume a specific, paused execution path inside LangGraph required meticulous session-thread tracking using Slack timestamps (thread_ts).

Environment & Permission Obstacles: Running SQLite on modern cloud container architectures presented strict read-only permission issues at root-level structures. We overcame this by implementing dynamic lifespan-hooks that auto-generate and seed a sandboxed mock dataset into ephemeral storage (/tmp/company.db) on startup.

Interactivity Webhooks: Taming the Slack API to ensure event subscriptions and button actions were routed smoothly to the same web instance without generating 404 or payload encoding errors required complex URL-encoded parser middleware inside FastAPI.

Accomplishments that we're proud of

Production-Grade Security Architecture: Designing a reliable Human-in-the-Loop (HITL) framework. The agent cannot modify a single row of data without direct administrative sign-off via Slack.

Successful MCP Handshake: Seamlessly integrating the brand-new Model Context Protocol (MCP) to decouple LLM reasoning from specialized enterprise business logic (SLA calculations).

Fully Automated Recovery Cycles: The agent doesn't just run an updated order status query; it fully reconciles stock quantities across warehouse nodes sequentially.

What we learned

State Management Mastery: We learned how to gracefully control multi-step state machines using LangGraph, handling pauses, data resumes, and branch conditions effectively.

Modern LLM Orchestration: We discovered the immense potential of the Model Context Protocol (MCP) for safely connecting autonomous entities to internal legacy systems.

Resilient Infrastructure Design: We gained hands-on experience navigating cloud container architecture constraints and setting up reliable webhooks under pressure.

What's next for SQL Ops Agent

Multi-Incident Batches: Expanding the graph architecture to parse and resolve multiple stuck orders across multiple database tables simultaneously.

Advanced LLM Generation: Moving from hardcoded recovery statements to safely-vetted, dynamically-generated LLM SQL mutations for complex database architectures.

Expanded Channel Collaboration: Implementing conversational threads where multiple team members can comment, query current inventory metrics, or override values interactively prior to final database sync.

Built With

Share this project:

Updates