AI Autopsy: Post-Mortem Diagnostic for Failed AI Builds

Home page of the app
Running a spec/requirement example

Inspiration

Most AI projects don't fail because the code broke. They fail because the thinking broke three steps earlier — a vague requirement, a wrong architectural assumption, a customer decision that overruled engineering judgment. By the time the output looks wrong, the root cause is invisible.

I've spent 20+ years in enterprise consulting watching this pattern repeat — at IBM, at Deloitte, and now at Salesforce deploying AI agents. Teams shrug and rebuild, or worse, ask the same AI that failed to explain why it failed. Neither works.

AI Autopsy started as a question: what if you could run a structured post-mortem on a failed AI build the same way a consulting team runs a root cause analysis on a failed enterprise implementation?

What it does

AI Autopsy is a diagnostic tool that analyzes failed or underperforming AI builds and produces a structured post-mortem report.

You paste one of three input types:

A failed prompt and AI output pair
A vague spec, PRD, or requirements description
A code snippet with a description of what went wrong

The tool runs it through a Judge LLM against a proprietary 17-mode failure taxonomy across 5 categories — Prompt & Framing, Spec & Planning, Architecture & Design, Context & Execution, and Decision & Governance.

The output is a structured Autopsy Report containing:

Primary and secondary failure mode identification
Evidence — direct quotes from your input proving the failure, never just labeling it
A causal chain showing how one failure created conditions for the next
A reconstructed spec showing what it should have said
A prioritized remediation plan
A verdict: Inconclusive, Recoverable, or Critical

The Judge never labels without proof. If the input lacks sufficient diagnostic signal, it returns Inconclusive rather than hallucinating a failure mode.

How we built it

The build followed strict spec-driven development across 4 planning artifacts before a single line of code was written — scope document, PRD, technical spec, and build checklist. The planning artifacts are included in the docs/ folder as required.

Architecture decisions made upfront:

Single-page frontend (HTML/CSS/JS) with a FastAPI proxy backend to keep the API key server-side and out of the browser entirely
The failure taxonomy lives in a local taxonomy.json file — pluggable and version-controlled, editable without touching application code
The Judge LLM receives only id, name, and definition from each failure mode for token efficiency — remediation text is pulled locally by FM-ID after classification and appended to the report
marked.js handles Markdown rendering of the report

The taxonomy itself is the core IP — 17 failure modes built from real enterprise consulting experience, not textbook definitions.

Challenges we ran into

The hardest problem was the taxonomy design. A failure taxonomy is only useful if the Judge LLM can match evidence to modes without hallucinating a classification. Early versions would force a failure mode even when the input was too vague to diagnose. The FM-00 (insufficient_data) null state was the fix — explicitly telling the Judge that returning Inconclusive is a high-accuracy result, not a failure.

The second challenge was test case design. The anchor case study — a real enterprise AI agent for candidate pre-screening that was eventually abandoned — initially produced the wrong primary failure mode because the input description didn't capture the governance failure that caused the project to collapse. The input had to be rewritten to include the decision ownership failure explicitly. This became a key insight: the quality of the autopsy is directly proportional to the quality of the failure description.

Accomplishments that we're proud of

The evidence-matching requirement. The Judge doesn't just classify — it cites specific language from your input to prove each failure mode. This was a deliberate architectural decision that separates AI Autopsy from a generic LLM wrapper.

The anchor case study passes with FM-15 (No Decision Ownership) as the primary failure mode — exactly what it was in real life. That validation matters.

5/5 test cases passed on primary failure mode identification with correct causal chains.

What we learned

Spec-driven development works — but only if the spec is grounded in real experience. The failure taxonomy took longer to design than the application itself. That's exactly as it should be.

The most important lesson from the test suite: an AI diagnostic tool is only as good as the failure vocabulary it reasons against. Generic taxonomies produce generic diagnoses. The 17-mode taxonomy built from enterprise consulting experience produced diagnoses that matched real outcomes.

What's next for AI Autopsy

Batch analysis — run multiple failed builds through the same taxonomy and surface patterns
Taxonomy versioning — let teams extend the failure modes with their own organizational patterns
Integration with GitHub — analyze PRs, commit histories, and issue threads as input types
A public failure library — anonymized case studies mapped to the taxonomy, searchable by failure mode

Built With

claude-api-(anthropic)
css
fastapi
html
javascript
marked.js
python
uvicorn

Updates

Jack Lakkapragada started this project — Apr 07, 2026 07:11 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.