Atlas

Atlas

Inspiration

The idea for Atlas came from a deeply uncomfortable realization.

In 1977, a communication breakdown between pilots and air traffic control caused two Boeing 747s to collide on a runway in Tenerife — the deadliest aviation accident in history. In 1986, NASA engineers who knew about the O-ring vulnerability in cold weather were overruled by management pressure. In 2003, the same pattern repeated itself when foam strike concerns were dismissed before Columbia disintegrated on re-entry.

In 2001, Enron collapsed when a toxic leadership culture systematically crushed internal dissent, hiding billions in losses behind fraudulent accounting. In 2019, Boeing's 737 MAX crashed twice because a safety-critical flight control system was wired to a single sensor with no redundancy — and the system's existence was hidden from pilots to avoid expensive retraining.

Five disasters. Four different industries. Different root cause categories — communication breakdown, ignored warning signs, leadership failure, single point of failure — but one shared pattern: organizations that failed to learn from failures outside their own domain.

And yet, none of these organizations learned from each other. The aviation industry did not stop the financial sector from repeating the same failure pattern. The financial sector did not stop the aerospace industry from doing it again. Every post-mortem was written, filed, and forgotten within its own silo.

This is what inspired Atlas. Not just the failures themselves, but the horrifying realization that the knowledge to prevent them existed — it just wasn't connected.

The standard approach to failure prevention is prediction: try to anticipate what might go wrong. Atlas takes the opposite approach. Instead of predicting failure, it makes the cost of ignorance impossible to justify. Every failure worth making has already been made. Atlas connects them.

What It Does

Atlas is the world's first cross-domain failure knowledge base — a platform that ingests real post-mortems, incident reports, and failure analyses from across software engineering, aviation, finance, healthcare, government, infrastructure, space, cybersecurity, manufacturing, and military domains, then uses AI to extract structured failure data and connect failures across domains based on their underlying root cause patterns.

It has three core features:

1. Semantic Failure Search

A user types a symptom, a situation, or a concern — in natural language — and Atlas returns the most historically analogous failures from across all domains, ranked by semantic similarity.

Search: "team keeps ignoring safety concerns before launch"

Atlas returns: Challenger (1986), Boeing 737 MAX (2019), Deepwater Horizon (2010) — along with a cross-domain insight explaining the shared pattern.

2. Cross-Domain Connection Engine

Every failure in Atlas is linked to the most semantically similar failures across the entire knowledge base, regardless of domain. A software engineering post-mortem about a cascading failure is automatically connected to the Tenerife air disaster if their underlying failure descriptions, root causes, and lessons are semantically similar. These connections are computed using vector similarity search over Gemini-generated embeddings.

3. "Before You Build" Project Analyzer

A user describes what they are building. Atlas analyzes the description against 55+ historical failures, identifies the most analogous past failures, and generates a personalized risk profile: most likely root causes, specific warning signs to watch for, and recommended mitigations — all grounded in real historical evidence.

How We Built It

Atlas is a full-stack application with four distinct layers that work together as a pipeline.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                        ATLAS SYSTEM                         │
│                                                             │
│  ┌──────────────┐    ┌───────────────┐    ┌─────────────┐  │
│  │  Ingestion   │───▶│  Extraction   │───▶│   Storage   │  │
│  │    Layer     │    │    Layer      │    │    Layer    │  │
│  │              │    │ (Gemini API)  │    │ PostgreSQL  │  │
│  │ Post-mortems │    │               │    │ + pgvector  │  │
│  │ Incident logs│    │  Structured   │    │             │  │
│  │ Public data  │    │  FailureRecord│    │             │  │
│  └──────────────┘    └───────────────┘    └──────┬──────┘  │
│                                                   │         │
│  ┌────────────────────────────────────────────────▼──────┐  │
│  │                   Query Layer                         │  │
│  │                                                       │  │
│  │  Semantic Search  │  Root Cause Filter  │  Analyzer   │  │
│  │  (pgvector ANN)   │  (SQL WHERE clause) │  (Gemini)   │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │          React + Vite + Tailwind CSS Frontend         │  │
│  │  HomePage │ SearchPage │ FailurePage │ AnalyzePage   │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Layer 1: Data Ingestion

We seeded Atlas with 55+ real, documented failures spanning 10 domains. Each failure record contains:

what_failed: What physically or operationally broke
root_cause: The underlying cause beneath the symptom
root_cause_category: One of 12 universal categories (see below)
warning_signs: Signals that were present but ignored
lesson: The single most transferable takeaway

The 12 universal root cause categories were designed to transcend domain boundaries:

Category	Example
Communication Breakdown	Tenerife (1977), Mars Climate Orbiter (1999)
Ignored Warning Signs	Challenger (1986), Columbia (2003)
Single Point of Failure	Facebook outage (2021), Boeing 737 MAX (2019)
Incentive Misalignment	Lehman Brothers (2008), Ford Pinto (1970)
Over-Complexity	Therac-25 (1985-87), Cloudflare outage (2019)
Human Error	Amazon S3 outage (2017), Air France 447 (2009)
Process Failure	Knight Capital (2012), Healthcare.gov (2013)
Technical Debt	Ariane 5 (1996), Heartbleed (2014)
Scaling Failure	Twitter fail whale (2008-09)
Security Negligence	Equifax (2017), SolarWinds (2020)
Leadership Failure	Enron (2001), Kodak (2012)
External Dependency Failure	Log4Shell (2021), NotPetya (2017)

Layer 2: LLM Extraction

Raw failure texts are processed by Google Gemini using a structured extraction prompt that returns a typed JSON FailureRecord. The model is instructed to identify not just what happened, but what the underlying cause was — distinguishing the symptom from the root.

Layer 3: Vector Storage and Embedding

Each FailureRecord is converted into a dense vector representation using the Google Gemini Embedding API (gemini-embedding-2-preview, truncated to 384 dimensions). The embedding is computed over a rich text representation that includes the title, domain, what failed, root cause category, and lesson:

$$\vec{v}_i = \text{Embed}\left(\text{title}_i \oplus \text{domain}_i \oplus \text{what_failed}_i \oplus \text{root_cause_category}_i \oplus \text{lesson}_i\right)$$

Vectors are stored directly in PostgreSQL using the pgvector extension as a vector(384) column alongside all relational fields. This unified approach means both structured filtering (SQL WHERE clauses on domain, severity, year) and semantic search (approximate nearest neighbour via pgvector) run in a single database — eliminating the operational complexity of maintaining a separate vector store.

Layer 4: Semantic Search and Ranking

When a user submits a query $q$, it is embedded into the same vector space and compared against all stored failure vectors using cosine similarity:

$$\text{score}(q, \vec{v}_i) = \frac{\vec{q} \cdot \vec{v}_i}{|\vec{q}| \cdot |\vec{v}_i|}$$

The top-$k$ results are returned, ranked by similarity score. When results span more than one domain, Gemini generates a cross_domain_insight that names the pattern connecting them.

The Relationship Engine

After all failures are ingested, a post-processing step runs to compute cross-domain connections. For each failure $f_i$, we query PostgreSQL via pgvector for the top-5 most similar failures (excluding $f_i$ itself):

$$\text{Related}(f_i) = \underset{j \neq i}{\text{top-5}} \; \text{score}(f_i, \vec{v}_j)$$

These IDs are stored in the related_failure_ids field, forming a graph of cross-domain connections.

The "Before You Build" Analyzer

Given a project description $p$, the analyzer:

Embeds $p$ into the vector space using the Gemini Embedding API
Retrieves the top-8 most analogous historical failures via pgvector ANN search
Formats them into a Gemini prompt that asks for a structured risk profile
Returns: risk summary, most likely root causes, personalized warning signs, and recommended mitigations

The risk level assignment (Low, Medium, High, or Critical) is determined by Gemini's analysis of the retrieved analogous failures in context. The LLM evaluates the severity and relevance of the historical patterns to the described project and produces a holistic risk assessment grounded in the evidence provided.

Tech Stack

Layer	Technology
Backend	Python 3.11, FastAPI, Uvicorn
AI / LLM	Google Gemini (2.5 Flash)
Vector + Relational DB	PostgreSQL with `pgvector` extension
Embeddings	Google Gemini Embedding API (`gemini-embedding-2-preview`, 384d)
Frontend	React 18, Vite, Tailwind CSS (dark glassmorphism)
HTTP	Axios

Challenges We Ran Into

Challenge 1: The Root Cause Abstraction Problem

The hardest design decision was defining root cause categories that are genuinely universal — that apply equally to a software deployment failure and an aviation accident. Domain-specific taxonomies are useless for cross-domain connection. We iterated through several candidate taxonomies before settling on 12 categories that balance specificity (specific enough to be actionable) with universality (abstract enough to transcend domain).

The key insight was that root causes must describe organizational and systemic patterns, not technical symptoms. "Race condition" is a symptom. "Over-Complexity" is a root cause. "Memory leak" is a symptom. "Technical Debt" is a root cause.

Challenge 2: Embedding Quality for Cross-Domain Retrieval

Early versions of the embedding produced poor cross-domain results because we embedded only the title and domain. A semantic search for "ignored engineering warnings" would retrieve other software failures but miss Challenger, because the embedding was dominated by domain vocabulary.

The fix was to embed a richer composite text that emphasizes what failed, the root cause category, and the lesson — the parts most likely to be semantically similar across domains. This dramatically improved cross-domain retrieval quality.

Challenge 3: Distinguishing Symptom from Root Cause in LLM Extraction

Gemini would sometimes conflate the symptom with the root cause. For the GitLab database deletion incident, an early extraction returned root_cause: "engineer ran wrong command" — which is the symptom. The true root cause is Process Failure (no safeguards on destructive operations, no staging environment validation).

We fixed this by adding explicit prompt instruction to distinguish between the immediate trigger and the underlying systemic cause, and by providing examples of the distinction in the extraction prompt.

Challenge 4: Building a Meaningful Graph Without Manual Curation

The relationship engine uses pure vector similarity to link failures. Because the composite embedding text emphasizes root cause categories, lessons, and what failed — rather than domain-specific vocabulary — the similarity scores naturally tend to surface structurally meaningful connections across domains. However, this approach can occasionally link failures that are textually similar but not deeply connected in root cause. The quality of the composite embedding text is the primary lever for ensuring cross-domain connections are causally meaningful, not just superficially similar.

Accomplishments That We're Proud Of

The cross-domain connections actually work. Searching "team ignored safety warnings before launch" correctly surfaces Challenger, Boeing 737 MAX, Deepwater Horizon, and Enron — spanning aviation, aerospace, energy, and finance — with a generated insight that names the shared pattern.

The "Before You Build" analyzer produces genuinely useful output. Describing a two-sided marketplace returns a risk profile grounded in Knight Capital, Barings Bank, eBay's early scaling failures, and MySpace's decline — with specific, actionable warning signs tailored to the project description.

The data is real. Every failure in Atlas is a documented, real-world incident with accurate historical information. This is not a demo with placeholder content — the knowledge base is genuine.

The design is distinctive. The dark glassmorphism aesthetic — frosted glass cards, layered translucency, and a deep dark background — was designed to communicate the seriousness of the subject matter while remaining immediately navigable.

What We Learned

Failure knowledge is the most undervalued asset in any organization. Post-mortems are written, shared internally, and then forgotten. The return on investment of connecting them across industries and time periods is enormous — and nobody had built the infrastructure to do it.

Cross-domain thinking is a muscle. The most valuable insights in Atlas come not from within-domain retrieval but from connections across domains. An engineering team benefits more from reading about Tenerife than from reading about another software outage — because the cross-domain connection forces them to see the pattern rather than the technical details.

The right level of abstraction is everything. Too specific (race condition, buffer overflow) and the knowledge stays domain-locked. Too general (things went wrong) and there is nothing actionable. The 12-category root cause taxonomy sits at the right level — specific enough to guide action, abstract enough to travel across domains.

LLMs are exceptional at structured knowledge extraction from unstructured text. The extraction pipeline — turning a raw post-mortem narrative into a typed FailureRecord with root cause category, warning signs, and lesson — works remarkably well with a carefully engineered prompt. This is one of the highest-value applications of LLMs that we have seen.

What's Next for Atlas

Short Term

Browser extension: A Chrome extension that detects when a user is reading an engineering blog post or incident report and offers to add it to Atlas with one click.
Community submissions: Allow users to submit post-mortems for review and inclusion.
API access: A public REST API so teams can query Atlas programmatically from their own tools — e.g., querying Atlas before a deployment decision.

Medium Term

Slack / Jira integration: A bot that watches project management tools for warning sign keywords (e.g., "skip the test", "we'll fix it later", "management says it's fine") and proactively surfaces historically analogous failure patterns.
Timeline visualization: A visual timeline of failures grouped by root cause category, showing how the same patterns have repeated across decades and industries.
Team risk profiles: Organizations can tag which root cause categories appear most frequently in their own past failures, and Atlas will weight its recommendations accordingly.

Long Term

Active learning from organizational data: Organizations can connect Atlas to their private incident management systems (PagerDuty, Jira, etc.), allowing Atlas to learn from their proprietary failure history and provide personalized risk assessments that blend public and private knowledge.
Predictive integration into CI/CD: An Atlas plugin for GitHub Actions that runs the "Before You Build" analyzer on pull requests that introduce significant architectural changes, surfacing relevant historical failures before code is merged.

The ultimate vision for Atlas is a world where no organization ever fails for a reason that another organization has already documented. The knowledge exists. Atlas connects it.

Built With

axios
beautiful-soup
docker
fastapi
framer-motion
gemini
google
httpx
javascript
pgvector
postgresql
pydantic
python
react
render
sentence-transformer
sql
tailwindcss
vite