AI Incident Database: Comprehensive Reproducible Research Framework

This documentation details the technical architecture and implementation of the solution for Problem Statement 5: AI Incident Database - Reproducible Research Notebooks. The framework provides a transparent, replicable, and highly performant system for analyzing AI failures and harms.


1. Problem Statement & User Experience

The AI Incident Database (AIID) catalogs documented failures of AI systems. While the data is public, it lacks accessible, reproducible analytical tools to lower the barrier for policy engagement. Our solution addresses this through:

  • Automated Data Lifecycle: Transitions from raw web-based snapshots to structured dataframes without manual intervention.
  • Integrated Analysis: Merges complex relational files including reports, incident classifications, and duplicate mappings.
  • High-Performance Exploration: A custom-built UI that remains responsive even when handling massive datasets by shifting the computational load from the Python kernel to the browser.

2. Technical Solution Architecture

A. Automated Data Ingestion (FastAPI Service)

We implemented a local FastAPI service to manage the data pipeline programmatically.

  • Latest Snapshot Discovery: Uses regular expressions (backup-(\d{8})\d*.*\.(tar\.bz2|zip)) to identify the most recent database backup by date.
  • Chunked Streaming: Downloads large files in 4MB chunks to maintain memory efficiency and prevent timeout failures.
  • Automated Extraction: Integrates with the tarfile library to automatically decompress .tar.bz2 files into a raw data directory for immediate processing.

B. Data Engineering & Normalization Pipeline

The raw MongoDB export is processed through a multi-stage pandas pipeline to ensure data integrity.

  • Taxonomy Merging: Consolidates various classification standards (CSET v0, CSET v1, GMF, MIT) into a unified dataset for comparative research.
  • Entity Resolution: Normalizes incident_id and report_number using string stripping and type conversion to ensure clean join keys.
  • Duplicate Handling: Applies a duplicate_map derived from duplicates.csv to map duplicate incidents back to their "true" incident number, preventing over-counting.
  • Relational Joins: Executes a "many-to-one" merge between incident summaries and individual report data.

C. High-Performance Browser-Based Explorer

The core of the user experience is a custom framework built using an HTML/CSS/JS template.

  • Kernel Independence: Standard Python table libraries in Google Colab route every search and filter interaction back through the Python kernel. With large datasets, this causes the kernel to become unresponsive.
  • JSON Serialization: Our custom framework serializes the entire dataset into JSON at render time.
  • Client-Side Processing: By handling all filtering, sorting, and row selection directly in the browser using JavaScript, we achieve near-instantaneous UI responses regardless of dataset size.
  • Dynamic Filtering: Supports real-time filtering across Harm Domain, Sector, Severity, and Risk Domain.
  • Research Export: Includes a built-in CSV download helper that allows users to export filtered or selected results directly from the browser.

3. Operational Guidelines

  1. Initialize API: Run the code blocks to start the uvicorn-managed FastAPI server at http://0.0.0.0:8000.
  2. Fetch Data: Execute the download_latest request to automatically pull and extract the dataset from the AIID website.
  3. Data Processing: Run the normalization cells to merge classifications and resolve incident duplicates.
  4. Interactive Exploration: Use the render_incident_explorer function to launch the JS-powered UI and begin data analysis.

4. Limitations & Assumptions

A. Data Limitations

1. Reporting Bias

The AI Incident Database relies on publicly reported incidents. As a result:

  • Incidents from highly regulated sectors or English-language media may be overrepresented.
  • Underreported regions, private deployments, or minor incidents may not be captured.
  • The dataset reflects visibility, not necessarily true global incidence rates.

2. Incomplete or Sparse Fields

Several classification attributes (e.g., harm domain, severity, sector) may contain missing or partially filled values.

  • Analytical outputs should be interpreted with awareness of column-level missingness.
  • Absence of classification does not imply absence of harm.

3. Taxonomy Evolution Over Time

Classification standards (CSET v0, CSET v1, GMF, MIT) evolved across snapshots.

  • Harmonization into a unified schema may introduce minor conceptual compression.
  • Certain historical fields may not map perfectly into newer standards.

4. Duplicate Resolution Assumptions

The duplicate_map derived from duplicates.csv assumes:

  • The "true" incident ID identified in the source file is correct.
  • All duplicates are properly declared in the snapshot metadata.
  • Undetected duplicates may still exist.

B. Technical Assumptions

1. Snapshot Stability

The ingestion pipeline assumes:

  • Snapshot file naming follows the regex structure: backup-(\d{8})\d*.*\.(tar\.bz2|zip)
  • The schema within snapshots remains structurally consistent.
  • Major upstream schema changes may require pipeline updates.

2. Client-Side Processing Constraints

The browser-based explorer serializes the dataset into JSON at render time.

  • Extremely large future snapshots may hit browser memory constraints.
  • Performance depends on the user's local machine resources.

3. Local API Environment

The FastAPI service assumes:

  • Local execution permissions
  • Port 8000 availability
  • Stable internet connection during snapshot download
  • Port conflicts or restricted environments may require reconfiguration.

C. Analytical Assumptions

1. Incident-Level Aggregation

The master table enforces one row = one unique incident. When merging multiple reports per incident, summary-level joins may obscure report-level nuance.

2. Neutral Interpretation

The framework is designed for exploratory and descriptive analysis. It does not:

  • Infer causality
  • Assign legal responsibility
  • Predict future AI risk

Interpretation remains the responsibility of the researcher.


5. AI Usage Disclosure

We used ChatGPT and Claude as coding assistants throughout this project to help with debugging, feature engineering, and general code generation. All AI-generated suggestions were reviewed and tested by our team.

Built With

Share this project:

Updates