Logo

🧠 Organizational Memory

Inspiration

We've all experienced it. A senior engineer leaves, and suddenly nobody knows why the system was architected a certain way. A manager retires, and the institutional knowledge about a key client relationship walks out the door. A compliance officer departs, and the reasoning behind a critical policy decision is lost to time.

Every organization has a collective memory—years of decisions, debates, trade-offs, and lessons learned—but it lives scattered across thousands of emails, buried in inboxes that nobody will ever search again. When people leave, that memory doesn't get transferred. It just disappears.

We asked ourselves: what if an organization could remember everything it ever discussed, and anyone could simply ask it a question?

That question became Organizational Memory.

What It Does

Organizational Memory transforms an organization's email archive into a living, queryable knowledge base. Instead of requiring people to write documentation (they won't), we work with what already exists — the emails they've already sent.

Users ask plain-English questions like:

"Why did we use special purpose entities?"
"What concerns did employees raise about accounting practices?"
"Who was involved in the California energy trading decisions?"
"What did executives know about the Raptor transactions?"

The system retrieves the most relevant emails from the archive, then generates a grounded answer that explains not just what happened, but why — citing specific emails with sender, date, and subject line so every claim is traceable back to its source.

How We Built It

We built a RAG (Retrieval-Augmented Generation) pipeline entirely on AWS:

Data Ingestion — We started with the Enron email dataset: 517,000 raw messages. We parsed RFC 2822 headers, extracted plain-text bodies, and deduplicated by content hash, producing ~248,000 unique emails stored as structured text files on S3.
Embedding & Indexing — Amazon Bedrock Knowledge Bases handles the heavy lifting. Each email is chunked and embedded using Titan Embeddings v2, then stored in an OpenSearch Serverless vector index for fast semantic retrieval.
Retrieval & Generation — When a question comes in, the system retrieves the most relevant email excerpts from the Knowledge Base, then passes them to Amazon Nova Pro via the Converse API. The model synthesizes an answer grounded in the retrieved emails, focusing on surfacing the reasoning and context behind decisions.
API & Frontend — A Lambda function orchestrates the retrieve-then-generate flow, exposed through an API Gateway HTTP API. A Streamlit frontend provides the demo interface with example questions, source citations, and error handling.

Why Enron?

The Enron email dataset is one of the largest publicly available corporate email corpora—real internal communications spanning years of business decisions, executive discussions, and organizational change. It's messy, it's real, and it's exactly the kind of unstructured knowledge that organizations lose every day.

It also tells a cautionary story. The emails contain early warnings, internal concerns raised by employees, and decision trails that—if they surfaced at the right time— might have changed the outcome. Organizational Memory is about making sure those signals don't stay buried.

The same pipeline can be pointed at any organization's email archive. Enron is our proof of concept; the problem is universal.

What We Learned

Deduplication matters more than you'd think. Nearly half of the Enron corpus (~270k emails) were duplicates — sent items mirroring received items, forwards of forwards. Without deduplication, the vector index would be polluted with redundant content, degrading retrieval quality.
The two-step RAG pattern (retrieve, then generate) gives more control than retrieve-and-generate. Separating retrieval from generation lets us tune each step independently—how many sources to pull, how much context to pass, and which model to use for synthesis.
OpenSearch Serverless provisioning is the long pole. Collection creation takes 10–20 minutes, and a full KB sync over 248k documents takes hours. Starting these early and working on other tasks in parallel was critical.