Inspiration
The 2008 financial crisis is taught as a macroeconomic event, GDP contractions, bank failures, Fed interventions. But underneath every headline number is a human transaction: a family that applied for a mortgage and was told no. We wanted to find those transactions in the data.
The Home Mortgage Disclosure Act dataset is one of the most granular public records of the crisis that exists. Every loan application filed with a U.S. financial institution, originated, denied, withdrawn, is in there, geocoded to the state level, tagged with loan type, borrower income, and purpose. Eleven years of it, 2007 through 2017. We were inspired by the question: what does the crisis look like when you read it as a data story instead of an economics textbook?
The answer, it turned out, was stark. Not a smooth curve down and back up, but a freeze, a structural substitution, and a recovery that was geographically uneven in ways that conventional narratives completely missed.
What it does
Fallout is an data dashboard on the U.S. mortgage market from 2007 to 2017, structured in four chapters:
Chapter 1: The Fallout shows how credit froze using a gap area chart: total mortgage applications on top, actual origination on the bottom, and every denied or abandoned loan living in the red space between them. A second visualization shows the loan type composition shifting in real time, conventional lending collapsing from 88% to 57% as FHA and VA government-backed programs tripled their market share.
Chapter 2: The Recovery extends the gap chart across the full decade to reveal the 2012 inflection point, then shows a refinancing wave area chart proving that the early recovery was policy-driven: when the Fed cut rates to near-zero, homeowners refinanced en masse, not new buyers entering the market, but existing owners escaping their old loans.
Chapter 3: The Behavior Shift lets judges toggle between 2007 and 2017 across three synchronized visualizations: a Borrower ID Card showing how the median approved borrower's income, approval odds, and loan type changed between the two eras; a dual-state US choropleth that flips between origination volume density in 2007 and a recovery index in 2017, the geographic story completely reorganizes; and a ranked leader board of top and bottom states that shows entirely different winners before vs. after.
Chapter 4: The Summary distills the decade into six headline numbers and an annotated timeline for an executive reading this in 2018.
How we built it
How we built it
The dataset spans 11 years of nationwide HMDA records — hundreds of millions of rows total. Loading all years simultaneously was not feasible on local hardware, so we designed an incremental aggregation pipeline in PySpark. Each year's raw CSV is loaded, reduced to five summary aggregations, merged into a cumulative JSON state file, and then deleted from disk. The raw file never needs to coexist with any other year in memory — peak memory usage is always bounded to a single year plus the small state file.
The five aggregations that power every chart:
Q1 — Credit freeze: For each year, count all applications and count only action_taken = 1 (originated). Origination rate = originated / total. The gap between the two counts is the visual story.
Q2 — Loan type shift: Among originated loans only, compute the share of each loan_type code (1=conventional, 2=FHA, 3=VA, 4=FSA) as a percentage of total originations that year.
Q3 — Refinancing wave: Among originated loans, compute the share where loan_purpose = 3 (refinancing). A single percentage per year from 2010–2017.
Q4 — Borrower profile: For 2007 and 2017 only — median income of approved applicants (excluding nulls and the HMDA sentinel value 9999), approval rate, denial rate, and the modal loan purpose and loan type among originated loans.
Q5 — State recovery index: Count originations per state for 2007 and 2017. Recovery Index per state = 2017 volume divided by 2007 volume. Above 1.0 means fully recovered; below 1.0 means a permanent scar.
The frontend is a single self-contained HTML file — no framework, no build step — using D3.js for the choropleth with TopoJSON topology, Chart.js for the time-series visualizations, and vanilla JS for the borrower ID card transitions and era toggle. Typography is set in Libre Baskerville and DM Mono, giving the interface an editorial, documentary feel rather than a dashboard aesthetic. The sticky nav uses IntersectionObserver to track scroll position across the four chapters.
Challenges we ran into
Scale without infrastructure. The full HMDA dataset at all records runs to tens of gigabytes per year. We had no cluster — just local hardware. The incremental pipeline solved this architecturally, but the implementation required careful memory lifecycle management: caching the DataFrame in Spark memory before running five passes over it, then explicitly unpersisting before deleting the raw file. Without the cache, Spark would re-read and re-parse the CSV from disk five separate times per year.
The 9999 income sentinel. HMDA encodes "not applicable" income as the literal value 9999, representing $9,999,000, rather than null. Any median income calculation that doesn't strip this produces wildly inflated results. It's not prominently documented — we found it by noticing median approved incomes in the hundreds of thousands and tracing backwards to the raw field. The fix is a single filter condition but finding the problem cost real time.
FIPS codes across three systems. HMDA's state_code field is a numeric FIPS code. TopoJSON state geometry identifiers are also FIPS. But our display layer needed two-letter abbreviations for labels and the leaderboard. Getting the three-way alignment between HMDA data, D3 topology, and display strings required a careful lookup table with explicit zero-padding — state code 1 must become "01" to match Alabama's FIPS, otherwise the join silently drops it.
The dual-state choropleth. The 2007 map uses a sequential blue scale on raw origination volume. The 2017 map uses a diverging red-to-green scale centered at a recovery index of 1.0. Two completely different color encoding semantics on the same 50 shapes, toggled instantly. Getting D3's color transitions to animate smoothly while simultaneously swapping legends, updating tooltip logic, and re-sorting the leaderboard — all triggered by a single toggle — required coordinating four separate DOM updates without any layout shift or flash.
Accomplishments that we're proud of
The incremental pipeline architecture is something we're genuinely proud of, it's a practical solution to a real constraint that any data team working with large public datasets on limited hardware will face. The pattern of aggregate → merge → delete → repeat is clean, resumable, and produces frontend-ready JSON as a byproduct of each run.
The dual-state choropleth is the visualization we're most satisfied with. The transition from the 2007 volume map — where coastal states dominate and interior states barely register — to the 2017 recovery index map — where Texas is dark green and Nevada is still red, communicates the geographic reorganization of the mortgage market in a way that no table or bar chart could. It's the same 50 shapes encoding two completely different stories.
The four-chapter narrative structure. Every design decision in Fallout was made to serve the story, not to showcase the data. Charts are placed where they are because that's where the reader needs to see them. The gap chart appears twice, once truncated at 2010 to show the collapse, once extended to 2017 to show the recovery, because the same visual read differently at different scales is itself informative.
What we learned
The recovery was not a reversal, it was a reorganization. Nationally, origination volumes approached 2007 levels by 2016. But the states that led the recovery (Texas, Colorado, Washington, North Dakota) were not the states that led in 2007 (California, Florida, New York). The pre-crisis mortgage market was a story about coastal population centers. The post-crisis market was a story about economic diversification.
Government-backed lending never fully retreated. FHA and VA loans went from 12% of originations in 2007 to a peak of 42% in 2010 and settled at approximately 26% by 2017, roughly double their pre-crisis share. The federal government did not temporarily backstop the mortgage market; it permanently restructured it.
The 2012 recovery was a refinancing wave, not a homebuying recovery. At peak, 54% of all originations in 2012 were refinances. The American Dream of homeownership did not meaningfully return until 2014–2015, when purchase loan share finally climbed back above 50%. Policy instruments that lower rates rescue existing homeowners before they create new ones.
What's next for Fallout
The most immediate extension is MSA-level granularity. HMDA includes a msa_md field, Metropolitan Statistical Area codes — that would let us tell the story at the city level rather than the state level. The difference between Miami and Tampa, or between Detroit and Grand Rapids, is completely invisible in the current state-level choropleth. City-level data would make the geographic story dramatically more precise and more actionable for a lender.
A natural second direction is demographic disaggregation. HMDA includes applicant race, ethnicity, and sex. We deliberately excluded these fields from this version to keep the scope focused on market structure, but the denial rate story told by race is a critical part of what the crisis revealed and what the recovery obscured. A future chapter on equity in the recovery would be both analytically important and directly relevant to Lendarch's lending decisions.
Finally, the pipeline currently processes one year at a time by design. The logical next step is connecting it to the CFPB's API for the post-2017 HMDA data, making Fallout a live dashboard rather than a retrospective, one that could flag a new "trust gap" opening in real time.
Log in or sign up for Devpost to join the conversation.