AI Incident Database: Comprehensive Reproducible Research Framework
This documentation details the technical architecture and implementation of the solution for Problem Statement 5: AI Incident Database - Reproducible Research Notebooks. The framework provides a transparent, replicable, and highly performant system for analyzing AI failures and harms.
1. Problem Statement & User Experience
The AI Incident Database (AIID) catalogs documented failures of AI systems. While the data is public, it lacks accessible, reproducible analytical tools to lower the barrier for policy engagement. Our solution addresses this through:
- Automated Data Lifecycle: Transitions from raw web-based snapshots to structured dataframes without manual intervention.
- Integrated Analysis: Merges complex relational files including reports, incident classifications, and duplicate mappings.
- High-Performance Exploration: A custom-built UI that remains responsive even when handling massive datasets by shifting the computational load from the Python kernel to the browser.
2. Technical Solution Architecture
A. Automated Data Ingestion (FastAPI Service)
We implemented a local FastAPI service to manage the data pipeline programmatically.
- Latest Snapshot Discovery: Uses regular expressions (
backup-(\d{8})\d*.*\.(tar\.bz2|zip)) to identify the most recent database backup by date. - Chunked Streaming: Downloads large files in 4MB chunks to maintain memory efficiency and prevent timeout failures.
- Automated Extraction: Integrates with the
tarfilelibrary to automatically decompress.tar.bz2files into a raw data directory for immediate processing.
B. Data Engineering & Normalization Pipeline
The raw MongoDB export is processed through a multi-stage pandas pipeline to ensure data integrity.
- Taxonomy Merging: Consolidates various classification standards (CSET v0, CSET v1, GMF, MIT) into a unified dataset for comparative research.
- Entity Resolution: Normalizes
incident_idandreport_numberusing string stripping and type conversion to ensure clean join keys. - Duplicate Handling: Applies a
duplicate_mapderived fromduplicates.csvto map duplicate incidents back to their "true" incident number, preventing over-counting. - Relational Joins: Executes a "many-to-one" merge between incident summaries and individual report data.
C. High-Performance Browser-Based Explorer
The core of the user experience is a custom framework built using an HTML/CSS/JS template.
- Kernel Independence: Standard Python table libraries in Google Colab route every search and filter interaction back through the Python kernel. With large datasets, this causes the kernel to become unresponsive.
- JSON Serialization: Our custom framework serializes the entire dataset into JSON at render time.
- Client-Side Processing: By handling all filtering, sorting, and row selection directly in the browser using JavaScript, we achieve near-instantaneous UI responses regardless of dataset size.
- Dynamic Filtering: Supports real-time filtering across Harm Domain, Sector, Severity, and Risk Domain.
- Research Export: Includes a built-in CSV download helper that allows users to export filtered or selected results directly from the browser.
3. Operational Guidelines
- Initialize API: Run the code blocks to start the uvicorn-managed FastAPI server at
http://0.0.0.0:8000. - Fetch Data: Execute the
download_latestrequest to automatically pull and extract the dataset from the AIID website. - Data Processing: Run the normalization cells to merge classifications and resolve incident duplicates.
- Interactive Exploration: Use the
render_incident_explorerfunction to launch the JS-powered UI and begin data analysis.
4. Limitations & Assumptions
A. Data Limitations
1. Reporting Bias
The AI Incident Database relies on publicly reported incidents. As a result:
- Incidents from highly regulated sectors or English-language media may be overrepresented.
- Underreported regions, private deployments, or minor incidents may not be captured.
- The dataset reflects visibility, not necessarily true global incidence rates.
2. Incomplete or Sparse Fields
Several classification attributes (e.g., harm domain, severity, sector) may contain missing or partially filled values.
- Analytical outputs should be interpreted with awareness of column-level missingness.
- Absence of classification does not imply absence of harm.
3. Taxonomy Evolution Over Time
Classification standards (CSET v0, CSET v1, GMF, MIT) evolved across snapshots.
- Harmonization into a unified schema may introduce minor conceptual compression.
- Certain historical fields may not map perfectly into newer standards.
4. Duplicate Resolution Assumptions
The duplicate_map derived from duplicates.csv assumes:
- The "true" incident ID identified in the source file is correct.
- All duplicates are properly declared in the snapshot metadata.
- Undetected duplicates may still exist.
B. Technical Assumptions
1. Snapshot Stability
The ingestion pipeline assumes:
- Snapshot file naming follows the regex structure:
backup-(\d{8})\d*.*\.(tar\.bz2|zip) - The schema within snapshots remains structurally consistent.
- Major upstream schema changes may require pipeline updates.
2. Client-Side Processing Constraints
The browser-based explorer serializes the dataset into JSON at render time.
- Extremely large future snapshots may hit browser memory constraints.
- Performance depends on the user's local machine resources.
3. Local API Environment
The FastAPI service assumes:
- Local execution permissions
- Port 8000 availability
- Stable internet connection during snapshot download
- Port conflicts or restricted environments may require reconfiguration.
C. Analytical Assumptions
1. Incident-Level Aggregation
The master table enforces one row = one unique incident. When merging multiple reports per incident, summary-level joins may obscure report-level nuance.
2. Neutral Interpretation
The framework is designed for exploratory and descriptive analysis. It does not:
- Infer causality
- Assign legal responsibility
- Predict future AI risk
Interpretation remains the responsibility of the researcher.
5. AI Usage Disclosure
We used ChatGPT and Claude as coding assistants throughout this project to help with debugging, feature engineering, and general code generation. All AI-generated suggestions were reviewed and tested by our team.
Built With
- css
- fastapi
- html
- javascript
- python
Log in or sign up for Devpost to join the conversation.