About the Project — FIRSight

Inspiration

Police stations and law-enforcement agencies accumulate thousands of First Information Reports (FIRs) every year. These reports contain rich signals about modus operandi, victims, locations and timelines, but are usually locked in free-text and never analyzed at scale. FIRSight is motivated by the idea that AI + NLP can turn those narratives into actionable intelligence, helping investigators spot links, predict hotspots, and prioritize cases.

What we will learn (goals)

By building FIRSight we aim to learn and demonstrate:

How to preprocess and anonymize sensitive, multilingual legal text.
How to extract entities and events with Named Entity Recognition (NER) and relation extraction.
How to represent FIRs as dense embeddings and cluster similar incidents.
How to combine spatial-temporal modeling and ML to predict likely hotspots.
Best practices for privacy, bias mitigation, and explainability in public-safety AI.

Planned system overview — how we will build it

The project is structured in phases (design → data → models → interface → evaluation). Below is a practical plan and the concrete components we will implement.

1. Data & privacy

Data sources (planned): anonymized FIR excerpts from partner agencies or public crime datasets; synthetic data to bootstrap models.
Anonymization pipeline: remove/replace personal identifiers (names, addresses, phone numbers) using regex + NER-based redaction.
Multilingual handling: use multilingual tokenizers and models (or language detection → per-language pipeline).

2. Text processing & feature extraction

Preprocessing: normalize punctuation, expand abbreviations, sentence-splitting, transliteration where needed.
NER & event extraction: fine-tune an NER model to tag roles (suspect, victim), locations, dates, and crime types.
Embedding: convert each FIR to a vector embedding via a sentence transformer or other embedding model.
Similarity & clustering: compute pairwise similarity and cluster related FIRs. We’ll use cosine similarity:

[ \text{Similarity}(A,B)=\frac{A\cdot B}{|A||B|} ]

where (A,B\in\mathbb{R}^d) are embeddings. For TF–IDF baselines the score follows the standard term weighting:

[ \text{tfidf}(t,d)=\text{tf}(t,d)\cdot\log\frac{N}{\text{df}(t)} ]

3. Pattern detection & hotspot prediction

Case linking: for each new FIR, compute similarity to historical clusters and flag probable links above a threshold.
Hotspot modeling: aggregate geolocated incidents into spatial grid cells; use spatial smoothing or a simple regression / time-series model to score hotspot probability for cells.
Explainability: present top tokens/entities and exemplar FIRs that led to a link or hotspot prediction.

4. Interface & deliverables

Dashboard: searchable case list, cluster explorer, hotspot map, investigation checklist generator, and exportable summaries.
APIs: endpoints to submit new FIRs and return linked cases / hotspot risk.
Report generator: LLM-assisted draft checklists and investigation leads (templates + citations to source FIRs).

5. Evaluation & metrics

Linking accuracy: precision / recall on known linked case pairs.
Hotspot quality: area under ROC or precision@k for predicted high-risk cells vs. withheld incidents.
Human-in-the-loop validation: investigator feedback on usefulness and false positives.
Performance: latency for embedding + retrieval; storage/indexing efficiency.

Challenges we expect & mitigation strategies

Data quality & heterogeneity: FIRs differ greatly in style. Mitigation: robust preprocessing, synthetic augmentation, and human-in-the-loop corrections.
Multilingual text and code-mixing: many reports mix languages or local transliteration. Mitigation: language detection + transliteration + multilingual models.
Privacy & legal constraints: handling PII and legal sensitivity. Mitigation: strong anonymization, local processing (no raw uploads to 3rd parties), access controls, and audit logs.
Model bias & misuse: risk of biased suggestions or wrongful links. Mitigation: transparency, conservative thresholds, human review requirement, and regular bias audits.
Scalability: indexing and searching thousands of long reports. Mitigation: use vector indexes (e.g., FAISS/Annoy) + inverted index hybrid approach.

Example technical stack (suggested)

Data & infra: PostgreSQL (metadata) + object store for raw text; vector index (FAISS / Milvus).
NLP & models: Hugging Face transformers (sentence-transformers), spaCy / Stanza for NER, lightweight LLM (for checklist drafts).
Backend & APIs: Python (FastAPI) for microservices.
Frontend: React + Map library (Leaflet/Mapbox) for hotspot visualization.
Deployment: Docker + Kubernetes or a simple VPS depending on scale.

Milestones (ordered steps — no durations)

Gather & anonymize sample FIR data (or synthesize realistic samples).
Build preprocessing & NER pipeline; validate entity extraction.
Create embedding + similarity retrieval; test linking on held-out examples.
Implement hotspot aggregation & baseline prediction.
Prototype dashboard showing clusters and map.
Integrate LLM for checklist/report drafts and run human evaluation.
Audit, harden privacy, and prepare documentation for pilot deployment.

Success criteria / deliverables

Working prototype that ingests an FIR and returns: (a) top linked cases, (b) cluster membership, (c) hotspot score, and (d) an investigator checklist draft citing source text.
Evaluation report with metrics (precision/recall, hotspot evaluation) and a privacy / bias mitigation summary.
Demo dashboard that non-technical stakeholders can use to explore linked cases and map hotspots.