AutoStack

Inspiration

I've always been frustrated by how fragmented the software development lifecycle is. You use one tool to manage projects, another environment to write code, a separate pipeline for CI/CD, and a completely different set of dashboards for cloud infrastructure. Developers and teams are stuck context-switching between countless tabs and workflows just to ship a single feature.

I wanted to build something that actually understands the entire software delivery process from end to end. Not just a code autocomplete tool, not just an infrastructure script generator, but a complete ecosystem of specialized AI agents that collaborate to plan, write, test, review, and provision software autonomously.

What it does

AutoStack is a fully automated software delivery and cloud provisioning platform powered by a multi-agent AI system:

Intelligent Project Management - The PM agent breaks down your requirements into actionable tasks and technical specifications using Tavily-powered live web research to ensure recommendations reflect current best practices. It supports both new projects from scratch and importing existing GitHub repositories to implement new features via Pull Requests.

Autonomous Software Delivery - A specialized Developer agent generates production-ready code in batches while a QA agent performs a full automated code review — checking for security vulnerabilities, code smells, and performance issues — then posts structured feedback directly to the PR. The QA agent also generates comprehensive test suites and commits them to the same branch, with GitHub Actions CI automatically triggered on every commit.

Cloud Provisioning on Azure - Instead of manually clicking through the Azure portal, the Infra Architect agent analyzes your project's needs and designs a complete cloud architecture. The DevOps agent generates clean, modular Terraform code, and the SecOps agent validates every resource against industry standards using Checkov and estimates monthly costs via Infracost before you approve a single deployment.

Generative Architecture Maps - Using a custom-built RepoMap service backed by tree-sitter AST parsing and PageRank-based semantic ranking, the system maintains a context-aware understanding of the entire codebase. Combined with ChromaDB vector memory, agents retrieve architecture plans, interface contracts, and research context across the entire workflow — ensuring new features integrate cleanly with existing code.

Human-in-the-Loop Control - AutoStack supports both fully autonomous mode and a manual step-by-step mode where the workflow pauses before each phase for your review and feedback. You can approve, request changes, or redirect any agent at any point. The LangGraph graph rewinds to the appropriate phase and re-executes with your input incorporated.

Real-time Notifications - Connect your Slack or Discord and AutoStack keeps you in the loop throughout the entire workflow. You get notified when a workflow starts, when a PR is created (with a direct link), when QA completes its review, and when the workflow finishes or fails — so you're never left wondering what the agents are doing.

How we built it

This was a massive undertaking integrating cutting-edge AI orchestration with a full-stack web architecture:

The frontend is React with TypeScript, built on Next.js 16. I went with TailwindCSS v4 and Radix UI for a modern, accessible component system. For data fetching and server state, I used TanStack React Query v5 with an Axios client that handles JWT auth automatically via interceptors.

The backend is powered by FastAPI and Python, interacting with PostgreSQL via SQLAlchemy and Alembic for robust schema migrations and state persistence. The database supports both local PostgreSQL and Neon.tech for cloud deployments, with SSL and connection pooling configured out of the box.

At the core of AutoStack is LangGraph, which handles the complex orchestration of our multi-agent AI framework across two separate workflow graphs — graph.py for software delivery and cloud_graph.py for infrastructure provisioning. By modeling agents as stateful graphs with conditional edges and PostgreSQL-backed checkpointing, I created deterministic, resumable pipelines where 7 specialized agents collaborate seamlessly.

For the LLMs, I use Groq with Qwen3-32b for code generation tasks (Developer and QA agents) and Llama 3.3-70b for non-code tasks like planning and documentation. OpenRouter provides a fallback path. The split model strategy keeps costs low while maintaining quality where it matters most.

The RepoMap service is a custom-built code analysis engine using tree-sitter for multi-language AST parsing and NetworkX PageRank for ranking files by cross-reference importance. When agents need codebase context, they get a semantically ranked, token-budgeted map of the most relevant files — not a raw file dump.

ChromaDB Cloud serves as the vector store for long-term agent memory, using Azure OpenAI embeddings. Agents store and retrieve architecture plans, interface contracts, research context, and repo maps using both exact-key lookups and semantic search — so every agent in the pipeline has access to decisions made by earlier agents.

Credential security was a first-class concern. All sensitive values (GitHub tokens, Azure secrets, API keys) are encrypted at rest using Fernet symmetric encryption. Projects can use per-project credentials or fall back to system-level settings, and the credential manager handles the resolution transparently.

Challenges we ran into

Agent Coordination & State Management: Getting 7 agents to cooperate across two separate workflow graphs without getting stuck in infinite loops was genuinely hard. LangGraph is powerful, but designing the conditional edges, feedback loop routing, and interrupt/resume mechanics required careful state schema design with Pydantic TypedDicts throughout.

Context Window Limitations: Entire codebases easily exceed token limits. I had to build the RepoMap service from scratch — tree-sitter parses every file into AST tags (definitions and references), NetworkX builds a cross-reference graph, and PageRank ranks files by importance. Agents get a token-budgeted, semantically ranked map instead of raw file contents.

Human-in-the-Loop Feedback Routing: Implementing feedback loops that actually rewind the graph to the right phase was tricky. LangGraph's interrupt_before handles the pause, but routing feedback back to the correct node (plan → develop → test → document) required careful use of as_node and resetting downstream task statuses to PENDING so they re-execute cleanly.

Batched Code Generation: Generating an entire project in one LLM call is unreliable and hits token limits. I settled on a batched approach where the Developer agent generates 2-5 files per task, stores interface contracts in ChromaDB after each batch, and subsequent batches retrieve those contracts to stay consistent with what was already built.

Security and Sandboxing: Managing secrets across projects, users, and system settings while ensuring agents can't leak credentials into generated code or logs required building a layered credential system with Fernet encryption and careful separation between per-project and system-level credentials.

Accomplishments that we're proud of

Honestly, getting specialized AI agents to successfully hand off tasks just like a real engineering team feels like magic.

The LangGraph pipeline architecture is something I'm really proud of. Seeing the PM agent hand off a spec to the Developer agent, which then gets its code reviewed by the QA agent — which posts structured feedback directly to the GitHub PR, commits test files, and triggers GitHub Actions CI — all without human intervention, validates the core vision of the project.

The RepoMap + ChromaDB memory system is genuinely useful. The combination of PageRank-ranked AST maps for current codebase context and vector memory for cross-session architectural decisions means agents actually understand what they're building into, not just what they're building.

The SecOps validation pipeline is something I didn't expect to be as satisfying as it is. Watching Checkov scan generated Terraform, surface real security findings, and Infracost estimate the monthly bill before a single resource is provisioned — that's a genuinely enterprise-grade workflow running autonomously.

The dual-mode flexibility — seamlessly switching between building feature logic (Software Delivery) and spinning up cloud environments (Cloud Provisioning) — makes this a truly end-to-end platform. One tool, two completely different workflows, same agent-first philosophy.

What we learned

LangGraph is a game-changer for building agent systems. Before this, manual LLM chaining was a nightmare of error-prone glue code. Modeling agents as stateful graphs with proper nodes, conditional edges, and checkpointing makes complex multi-agent workflows actually maintainable and debuggable.

PageRank on code graphs is underrated. Building a reference graph from AST tags and running PageRank to rank files by importance is a surprisingly effective way to solve the context window problem. Files that are heavily referenced by others naturally bubble to the top — exactly what you want when you need to summarize a codebase in 6000 tokens.

Structured outputs from LLMs are non-negotiable for agent pipelines. Every agent uses Pydantic schemas with invoke_llm_structured. Without this, one malformed response from any agent collapses the entire pipeline. With it, failures are predictable and recoverable.

Split model strategies matter. Using Qwen3-32b for code and Llama 3.3-70b for planning/docs isn't just a cost optimization — the models genuinely perform better on their respective task types. Treating LLM selection as a per-agent configuration rather than a global setting was the right call.

Credential architecture needs to be designed upfront. Bolting on encryption and multi-tenant credential management after the fact is painful. Building the CredentialManager with Fernet encryption and a clear resolution hierarchy (per-project → system settings → environment) from the start saved a lot of headaches.

What's next for AutoStack

Richer Observability - LangSmith tracing is already wired in, but I want a proper in-app execution timeline that shows exactly what each agent did, which files it read, what it retrieved from memory, and how long each step took. Full transparency into the agent's reasoning.

Multi-repo Projects - Right now each project maps to one repository. Supporting monorepos and multi-service architectures where the Developer agent coordinates changes across multiple repos would unlock a whole new class of use cases.