JusticeLens

Inspiration

Boston has made police data publicly available through the Boston Police Index — but raw data alone doesn't tell a story. We were inspired by the gap between data availability and community understanding. Parents don't know which districts have the highest youth arrest rates. Advocates can't easily see whether misconduct complaints are being upheld. We wanted to close that gap using AI and data visualization to turn public records into actionable insight.

What it does

JusticeLens is a multi-page interactive dashboard that analyzes youth-police interactions in Boston across three datasets — arrests, misconduct complaints, and incidents. It surfaces where juvenile arrests are concentrated, reveals racial disparities in who gets arrested, tracks whether the accountability system responds to complaints, and uses Gemini AI to classify unstructured complaint narratives for youth involvement. Users can filter by district and year, toggle between juvenile and all-arrest views, and ask plain-English questions about the data.

How we built it

We downloaded CSV exports directly from the Boston Police Index and Analyze Boston data portal. Each dataset was cleaned and processed using Python and pandas — standardizing district codes, parsing dates, creating derived fields like juvenile flags, school hours booleans, and charge severity categories. The most technically complex component was the misconduct complaint analysis. IAD complaint narratives are stored as unstructured PDF documents. We used Gemini AI API to extract and classify these narratives, identifying youth-involved complaints with structured output including confidence scores and key evidence phrases — replacing unreliable keyword matching with semantic understanding. Four Streamlit pages were built in parallel — one per dataset — with shared CSS styling, a common color system, and synchronized filters using Streamlit session state. Charts are built with Plotly Express. The app is deployed on a Vultr Ubuntu instance running behind a screen session on port 8501.

Challenges we ran into

The biggest challenge was the lack of shared keys between datasets. Arrests record civilian information only — no officer identifiers. Misconduct data had officer names completely null. This made direct row-level joins impossible, so we had to reframe our analysis around parallel storytelling rather than direct linkage. We also discovered that 9.4% of arrest records have no district field, and that gap grew from 263 records in 2020 to 2,637 in 2024 — a data quality issue we turned into a transparency finding.

Accomplishments that we're proud of

We're proud of the PDF extraction pipeline that uses Gemini AI API to read unstructured IAD complaint documents and classify youth-involved cases with structured output — confidence scores, evidence phrases, and finding categories. This turned previously inaccessible narrative text into analyzable data. We're proud of surfacing racial disparities that exist in the raw data but were never clearly visible — showing that juvenile arrest patterns differ significantly from adult patterns across districts, charges, and time. We're proud of connecting four separate datasets into one coherent story about youth-police interaction — from first contact through accountability — despite the datasets having no shared keys and being collected across different time periods. And we're proud of building a tool that is honest about what it doesn't know. Data gaps, partial years, and missing fields are disclosed openly rather than hidden — because transparency about limitations is part of what makes a civic accountability tool trustworthy.

What we learned

Real civic data is messy and incomplete by design. The absence of shared keys between datasets isn't accidental. It's a structural gap that makes accountability harder to trace. Working around it taught us as much about the system as the data itself did.