REFERENCE_IMAGE
CURRENT_IMAGE
VISUAL_CHANGES

DeltaMind - Visual Change Detection Engine

Hackathon Project | Real-Time Visual Monitoring Pipeline

A general-purpose visual comparison engine that detects, segments, and tracks visual changes across time-series images and video streams. DeltaMind is designed to handle any visual comparison task, from manufacturing inspections to infrastructure monitoring.

Inspiration

Most modern visual inspection systems are brittle and siloed. A model trained to detect scratches on a factory gear cannot detect cracks in a bridge, and a system that can check brand logos cannot monitor infrastructure.

We were inspired by the need for a single, general-purpose engine that could handle any visual comparison task, regardless of the domain. Our goal was to build a universal translator for visual change.

What It Does

DeltaMind is a modular, three-stage pipeline that analyzes video streams or image sequences to find and isolate changes as they evolve over time. Instead of comparing only two static images, it continuously monitors a process, comparing each new frame against a "normal" reference state.

Stage 1: Align (The Foundation)

It first performs digital registration of every new frame in the sequence. Considering real-world monitoring applications, this step is necessary to get rid of camera jitter, slight variations in camera angle or zoom, and even changes in illumination conditions. Geometric registration (using SIFT/ORB) followed by photometric normalization ensures that every frame is compared "apples-to-apples" against the reference, whether it is the previous frame or an initial "golden" state.

Result: Two images that are perfectly lined up, as if taken from the exact same spot at the same time.

Stage 2: Detect (The "Dual-Check Brain")

It performs a "Dual-Check" on each aligned frame for the detection of anomalies with high confidence.

Fast Check (Temporal-Level): Utilizes SSIM + absolute-difference to immediately flag all possible areas of motion or pixel-level change between successive frames.
Deep Check (Feature-Level): Uses a PatchCore deep learning model. The magic is that this model is trained only on 'normal' video sequences of the process. It learns the "texture and motion of normal" and flags any new, developing defect (a crack forming, a component wearing down, a misprint) which it may have never seen before, simply as an "abnormal" deviation.

By fusing these two checks, we get high-confidence anomalies, filtering out "noise" such as shadows or benign, expected movements.

Stage 3: Segment (The "Precise Cut")

The bounding box locations for high-confidence anomalies from each frame are fed as prompts into a Segment Anything Model 2 (SAM 2). This model, in turn, generates a very precise, pixel-level mask of the shape of the change at that particular frame.

By stringing these masks together, DeltaMind does not only find a change but rather tracks its evolution, enabling analysis of size, shape, and rate of progression over time.

How We Built It

We designed DeltaMind as a Python-based system and prototyped the core pipeline to handle video streams.

Alignment: We employed OpenCV to implement SIFT/ORB for keypoint matching and cv2.warpPerspective. This aligns each new frame to a reference, thus enabling the compensation of camera vibration or drift. Histogram matching was performed for photometric normalization against changing light.
Detection: We implemented the temporal check (SSIM+difference) using OpenCV. As for the deep check, we used the Anomalib library to implement and train a PatchCore model. Most importantly, this was trained on video sequences of 'normal' operation, so it learned the "texture and motion of normal."
Segmentation & Tracking: We used the Hugging Face transformers library to load a pre-trained SAM 2 model. By feeding it prompts from our detection stage, it generates a precise pixel mask for the anomaly in each frame, thus tracking its progress.
Captioning (Optional): Alternatively, the sequence of anomaly masks and frames can be sent to a pre-trained Vision-Language Model (VLM). This model translates the visual input into a text description of the event, such as "A linear crack is appearing on the left strut."
Demo Interface: We wrapped this entire pipeline in a Streamlit application. Our interactive demo allows a user to upload a video file or connect a live webcam feed for real-time analysis, with a toggle to enable or disable the optional captioning.

Challenges We Ran Into

The real world is messy. Real-time video is even messier. By far the most significant challenge was a flood of False Positives. An approach based on frame-to-frame difference (SSIM) alone was a disaster, flagging everything from shadows to light flicker to camera jitter and even the normal, expected movements of machinery.

We solved this with our "Dual-Check" fusion. Because an anomaly needed to be flagged by both the fast temporal check via SSIM and the deep feature check via PatchCore, the vast majority of environmental and operational noise was filtered out. This fusion forms the crux of the reliability of our system.

Proving the "general-purpose" claim was another challenge for video. The answer was this "normal-only" training paradigm. We don't need to search for rare "defect" videos but only an easy-to-get video of "normal operation" for each new use case. The model learns the "texture and motion of normal," and hence, it is super fast to adapt.

Accomplishments We Are Proud Of

We are incredibly proud of the modularity within our 3-stage pipeline. The "Detect" stage is plug-and-play. We built it with PatchCore, but we've designed the architecture so that it could be swapped out for a Video Autoencoder (VAE) or even a foundation model such as DINOv3, without having to re-architect the whole system.
We also pride ourselves on the real-time robustness of the Align stage. Our demo can handle a live webcam feed successfully, correcting for jitter and movement, showing that this core concept is sound for real-world scenarios when cameras are never still.
Finally, we are proud that we don't just find change—we track it. By passing prompts to SAM 2 on every frame, we generate a live, evolving mask of the anomaly, showing its precise shape as it grows or moves.

What We Learned

We realized that such a "general-purpose" monitoring system is not a single magic model, but an intelligent pipeline of specialized modules. Our Alignment module handles the physics of the video feed (jitter and light), whereas our Detection module takes over with the perception of normal versus abnormal motion and texture. This division of labor is the key to making it work reliably in real time.

We also learned that SOTA models, such as SAM 2, are game-changers. Getting precise, pixel-level masks "for free" on each frame makes the output 10x more valuable and actionable than a simple bounding box. You're not just getting an alert; you're watching the precise geometry of a fault evolve.

What's Next for DeltaMind

This prototype, built during a hackathon, proved the core of our real-time monitoring engine. The next steps will involve scaling this into a production-grade MLOps platform.

Microservices API: Rewrite the processing pipeline using FastAPI and Celery/Redis as the task queue. This is very important because of the need to asynchronously ingest and process hundreds of concurrent video streams from enterprise clients.
Multi-Tenancy: Develop the full multi-tenancy backend in order to manage and serve the "normal-only" models for different clients and assets (e.g., one model for 'Machine A', another one for 'Bridge Section 4') from a single unified platform.
Predictive Analytics: Currently, we detect change. The next step is feeding the time-series data of the anomaly's growth (say, mask size, shape complexity) into a forecasting model, such as an LSTM. This will take DeltaMind from detecting failure to predicting it—e.g., "With the current growth rate, this crack will reach a critical threshold in 72 hours."

Applications

Manufacturing Inspections - Detect defects and deformations in production lines
Brand Compliance - Verify packaging and product consistency
Infrastructure Monitoring - Identify structural changes and damages
Automated Audits - Quality control and compliance verification

🔬 Experimental Journey

Before arriving at our current video-based pipeline, we explored multiple methodologies to find the most robust and practical solution.

Approaches We Tested

We implemented and compared four distinct detection strategies:

1. Hybrid Difference + SAM

Combined pixel-level difference (SSIM + Absolute Difference) with SAM for segmentation
Result: Excellent accuracy for static image pairs, simple and fast

2. Multi-Fusion Ensemble + SAM

Blended difference maps at multiple weights and thresholds for robustness
Result: Accurate but unnecessarily complex

3. Ensemble with K-Means Clustering + SAM2

Used multiple fused maps with K-Means clustering to group anomaly candidates
Result: Accurate but slow and required significant parameter tuning

4. Semantic Feature Difference (DINOv3 + SAM2)

Attempted to use deep vision transformer embeddings to detect semantic changes
Result: Failed to detect small-scale, pixel-level anomalies

Key Insights

Why Static Approaches Were Insufficient:

Lacked temporal context to track anomaly evolution
Required manual frame selection for comparison
Struggled to distinguish environmental noise from true defects
No ability to monitor continuous processes

Why DINOv3 Semantic Approach Failed: While DINOv3 excels at semantic understanding, it proved unsuitable for detecting subtle texture changes and missing parts. Pre-trained encoders alone, without task-specific decoders, cannot effectively identify pixel-level anomalies.

The Ideal (But Impractical) Solution: An encoder-decoder architecture trained end-to-end for anomaly detection would theoretically be superior—using a vision encoder (DINOv3/CLIP) paired with a custom decoder trained on normal examples. However, this requires large-scale domain-specific datasets, significant computational resources, and training time unavailable in a hackathon setting.