๐Ÿ•ต๏ธโ€โ™‚๏ธ SHERLOCK INVESTIGATOR โ€” Project Story

Inspiration

Modern multimodal AI is incredibly powerful โ€” but most applications still treat vision as passive.

You upload an image โ†’ the AI describes it.

Thatโ€™s where we saw the gap.

In real investigative, restoration, archival, and forensic workflows, experts donโ€™t just look โ€” they:

  • Zoom into details
  • Enhance degraded visuals
  • Cross-reference archives
  • Verify identifiers
  • Build evidence chains

We asked a simple question:

What if AI could investigate visual evidence the way a human detective does?

This idea became Sherlock Investigator โ€” an agentic forensic analyst that transforms passive perception into active investigation.


What it does

Sherlock Investigator is a multimodal AI system that analyzes images and video frames like a digital detective.

Instead of answering โ€œWhat is this?โ€, it answers:

โ€œWhat is this, how do we know, and what evidence supports it?โ€

Core capabilities include:

  • ๐Ÿ“น Video & image forensic analysis
  • ๐Ÿ” Region-of-interest detection (plates, decals, features)
  • ๐Ÿง  Agentic reasoning with visible thought logs
  • ๐ŸŒ Search grounding for verification
  • ๐Ÿ“Š Confidence scoring across evidence factors
  • ๐Ÿ—‚๏ธ Structured forensic verdict reports

Example workflow:

  1. User uploads archival footage.
  2. AI extracts key frames.
  3. It enhances details (contrast, OCR, zoom).
  4. Identifies vehicles, objects, or artifacts.
  5. Cross-references historical databases.
  6. Produces a grounded verdict with sources.

Mathematically, the confidence model aggregates evidence weights:

[ C_{final} = \sum_{i=1}^{n} w_i \cdot e_i ]

Where:

  • ( e_i ) = Evidence factor score
  • ( w_i ) = Reliability weight
  • ( C_{final} ) = Final identification confidence

How we built it

We built Sherlock Investigator using the Gemini 3 family via Google AI Studio, leveraging its multimodal and reasoning capabilities.

AI Stack

  • Gemini 3 Flash โ†’ Fast visual scanning & agent loops
  • Gemini 3 Pro โ†’ Deep reasoning & long-context analysis

Key API features used

  • Multimodal vision understanding
  • Code execution for image enhancement
  • Search grounding for verification
  • Structured output reasoning traces

System Architecture

Frontend

  • Investigation dashboard UI
  • Evidence viewport with overlays
  • Verdict cards & confidence panels

Backend orchestration

  • Frame extraction pipeline
  • Enhancement filters (contrast, zoom)
  • OCR & feature detection
  • Grounded search verification

Agent loop

  1. Detect features
  2. Enhance evidence
  3. Extract identifiers
  4. Search & verify
  5. Resolve discrepancies
  6. Produce verdict

Challenges we ran into

1. Low-quality archival footage

Many test videos were:

  • Grainy
  • Motion blurred
  • Low resolution

Solution:

We implemented iterative enhancement:

  • Contrast boosting
  • Region zoom
  • Multi-pass OCR

2. Hallucination risk in identification

Historical identification must be verifiable.

Solution:

We enforced grounding:

  • Claims require source matches
  • Plates & identifiers cross-checked
  • Verdict confidence tied to evidence

3. Latency vs reasoning depth

Deep analysis slowed demos.

Solution:

Two-tier processing:

  • Flash โ†’ fast visual loops
  • Pro โ†’ deep archival reasoning

4. Making reasoning understandable

Raw model reasoning is unreadable to users.

Solution:

We designed a Thought Log UI that translates reasoning into:

  • Planning steps
  • Hypotheses
  • Cross-references
  • Conclusions

Accomplishments that we're proud of

  • Built a true agentic vision investigator, not a chatbot
  • Enabled visible reasoning transparency
  • Implemented grounded verification pipelines
  • Designed a forensic evidence UX system
  • Achieved high-confidence identification from degraded media

Most importantly:

We demonstrated that multimodal AI can investigate, not just describe.


What we learned

This project taught us:

Technical

  • Agent loops dramatically improve vision accuracy
  • Grounding reduces hallucinations significantly
  • Enhancement pipelines are critical for OCR

Product

  • Users trust AI more when reasoning is visible
  • Evidence presentation matters as much as accuracy
  • Confidence scoring improves decision usability

Research insight

Passive vision is insufficient for expert workflows.

Agentic investigation is the future of multimodal AI.


What's next for SHERLOCK INVESTIGATOR

We see Sherlock evolving into a full forensic intelligence platform.

Planned expansions

๐Ÿ“š Deep Archive Mode

  • Upload manuals, films, registries
  • Long-context cross-verification

๐Ÿ”Š Audio Forensics

  • Engine sound diagnostics
  • Mechanical anomaly detection

๐Ÿ›ฐ๏ธ Geospatial Investigation

  • Location inference from footage
  • Historical map grounding

๐Ÿงพ Chain-of-custody reporting

  • Court-ready forensic documentation

๐Ÿ› ๏ธ Restoration assistant

  • Identify parts
  • Locate replacements
  • Verify authenticity

Vision

Sherlock Investigator represents a shift:

[ \text{Passive Vision} \rightarrow \text{Active Investigation} ]

We believe the next generation of AI systems wonโ€™t just see the worldโ€ฆ

Theyโ€™ll investigate it.


Share this project:

Updates