Butterfly

Inspiration

Equipment inspections are slow and daunting. Right now, with a shortage of technicians who rely on paper checklists and visual judgment, inspections are time-consuming and hard to standardize. We wanted to see whether modern solutions, such as vision-language models, could turn something as simple as a 360-degree walk-around video into an inspection report that helps technicians do their job faster.

What it does

Select a specific vehicle model, such as a 982M wheel loader.
Uploads a 360° video of that machine

The System: Knows the exact components for that model Scans the video every 10 frames for visible defects Generates a structured inspection report Recommends replacement parts tied directly to the model

Each detected component is classified as Pass, Monitor, or fail, with a reasoning and part recommendation so technicians can act on the information immediately.

How we built it

Model and Parts: The user first selects a vehicle model The backend uses that model to retrieve the correct inspection structure and part catalog from parts.cat.com Since no public API exists, we scraped model-specific parts data from Caterpillar's official parts site (parts.cat.com) Inspection reference images were taken from Caterpillar inspection PDFs to allow the model to know the standards that Caterpillar inspects at

Video Processing: The video is split into frames at every 10FPS increment Each frame is analyzed independently to avoid motion blur and missed fects

VLM: We fine-tuned Qwen3-VL-8B using inspection imagery to improve accuracy on industrial components Training used reinforcement learning with Volcano Engine Training and inference ran on 4 NVIDIA H200 GPUs

Inference WorkFlow: Each frame is passed to the VLM along with:

A list of model-specific parts
Allowed replacement components for that vehicle The model outputs structured results in a strict format: Glass / Mirrors Fail Cracked mirror housing visible on right side

Confidence Scoring:

We created token-level probabilities into a single confidence score per finding
This gives technicians a quantitative sense of reliability rather than a binary result.

Editable Reporting:

The final report is editable
Technicians can override findings and adjust statuses before submission

Challenges we ran into

No official parts API:

We had to scrape and normalize parts data while keeping an accurate model Model Hallucination Risk:
Constraining the model to a fixed, model-specific parts list was extremely important Confidence Estimation:
Raw logits are hard to interpret, so we designed a heuristic that produces a usable confidence score.

Accomplishments that we're proud of

Built a model-aware inspection system
integrated OEM parts data directly into the inspection output
Achieved structured, machine-readable reports with human-editable output
Successfully caterpillar-fine-tuned and deployed a large VLM on industrial inspection data