Inspiration
Equipment inspections are slow and daunting. Right now, with a shortage of technicians who rely on paper checklists and visual judgment, inspections are time-consuming and hard to standardize. We wanted to see whether modern solutions, such as vision-language models, could turn something as simple as a 360-degree walk-around video into an inspection report that helps technicians do their job faster.
What it does
- Select a specific vehicle model, such as a 982M wheel loader.
- Uploads a 360° video of that machine
The System: Knows the exact components for that model Scans the video every 10 frames for visible defects Generates a structured inspection report Recommends replacement parts tied directly to the model
Each detected component is classified as Pass, Monitor, or fail, with a reasoning and part recommendation so technicians can act on the information immediately.
How we built it
Model and Parts: The user first selects a vehicle model The backend uses that model to retrieve the correct inspection structure and part catalog from parts.cat.com Since no public API exists, we scraped model-specific parts data from Caterpillar's official parts site (parts.cat.com) Inspection reference images were taken from Caterpillar inspection PDFs to allow the model to know the standards that Caterpillar inspects at
Video Processing: The video is split into frames at every 10FPS increment Each frame is analyzed independently to avoid motion blur and missed fects
VLM: We fine-tuned Qwen3-VL-8B using inspection imagery to improve accuracy on industrial components Training used reinforcement learning with Volcano Engine Training and inference ran on 4 NVIDIA H200 GPUs
Inference WorkFlow: Each frame is passed to the VLM along with:
- A list of model-specific parts
- Allowed replacement components for that vehicle The model outputs structured results in a strict format: Glass / Mirrors Fail Cracked mirror housing visible on right side
Confidence Scoring:
- We created token-level probabilities into a single confidence score per finding
- This gives technicians a quantitative sense of reliability rather than a binary result.
Editable Reporting:
- The final report is editable
- Technicians can override findings and adjust statuses before submission
Challenges we ran into
No official parts API:
- We had to scrape and normalize parts data while keeping an accurate model Model Hallucination Risk:
- Constraining the model to a fixed, model-specific parts list was extremely important Confidence Estimation:
- Raw logits are hard to interpret, so we designed a heuristic that produces a usable confidence score.
Accomplishments that we're proud of
- Built a model-aware inspection system
- integrated OEM parts data directly into the inspection output
- Achieved structured, machine-readable reports with human-editable output
- Successfully caterpillar-fine-tuned and deployed a large VLM on industrial inspection data
What we learned
- VLM's perform much better when constrained by given information
- Industrial use cases require explainability and confidence
- Without APIs if we carefully scrape we can still have accurate systems
What's next for Butterfly
- Expand to more Caterpillar models and additional equipment categories
- Add temporal reasoning to track detect progression across frames and time
- Integration maintenance history to prioritize recurring failures
- Deploy on mobile so technicians can upload videos directly form their phones
Built With
- python
- qwen3-vl-8b
- verl
Log in or sign up for Devpost to join the conversation.