CatScan

Inspiration

Caterpillar performs over 5 million inspections per year across 50,000 field technicians. Today, that process involves paper checklists, manual photo uploads, and hours of technician time per machine. A single missed defect on a mining truck that runs 24/7 can cost hundreds of thousands in downtime. We asked: what if an AI could walk alongside the technician, see what they see, hear what they say, and generate the entire report automatically?

We built an end-to-end system that handles the full lifecycle: from a technician pointing their phone camera at the machine, through AI-powered defect detection, all the way to executive fleet analytics dashboards.

What it does

CATscan is an AI-powered inspection assistant for CAT heavy equipment. A technician opens the web app on their phone, starts a guided walkaround, and the system:

Guides them step-by-step through the inspection with voice prompts (ElevenLabs TTS), telling them exactly what to photograph and check at each station
Analyzes each photo in real-time using GPT-4o Vision + RAG retrieval from the 158-page CAT operation manual (embedded in Actian VectorAI DB)
Listens to the technician via OpenAI Realtime API (WebRTC) — they can speak hands-free while wearing gloves in extreme conditions
Assigns continuous risk scores (0-100) for every component with cited evidence from the manual
Generates a structured inspection report matching Caterpillar's actual report format
Produces executive summaries via Snowflake Cortex, translates reports for international dealers, and analyzes inspector urgency from voice sentiment
Learns from corrections — a 3-layer continual learning system (RAG memory, Bayesian calibration, LoRA fine-tuning) improves accuracy over time
Identifies parts visually — snap a photo of a worn component, get ranked CAT part numbers with fitment certainty
Tracks fleet health over time — Snowflake warehouse stores every inspection for fleet-wide risk trends and predictive maintenance

How we built it

Frontend: Next.js 16 with React 19, TypeScript, and Tailwind CSS. Mobile-first design optimized for one-handed operation in the field.

Backend: FastAPI (Python 3.12) handling video processing (FFmpeg frame extraction), audio transcription (Whisper), and the inspection pipeline.

AI Pipeline: GPT-4o Vision analyzes each frame with context from RAG retrieval. The operation manual is chunked, embedded with All-MiniLM-L6-v2, and stored in Actian VectorAI DB. Each finding cites the specific manual reference that justifies the risk score.

Voice: ElevenLabs for guided TTS prompts. OpenAI Realtime API (WebRTC) for hands-free conversational inspection. On-device Gemma 3n (via OpenRouter) for offline-capable knowledge queries.

Snowflake Cortex AI (5 functions):

COMPLETE() — executive summaries and natural language fleet queries ("Which machine has the highest failure rate?")
SUMMARIZE() — condense long inspection audio transcripts
TRANSLATE() — multi-language reports for CAT's global dealer network (Spanish, French, Portuguese, German, Japanese, Korean, Chinese)
EXTRACT_ANSWER() — Q&A over historical inspection data
SENTIMENT() — analyze inspector voice transcripts for urgency and concern levels

DigitalOcean Infrastructure:

3 Droplets (frontend, backend, AI worker) on a private VPC in nyc3
Managed PostgreSQL for persistent inspection history
Gradient™ AI Inference Cloud for indexing CAT documentation
Automated deployment via deploy.sh with Docker

Continual Learning: Inspector corrections feed back into three layers — immediate RAG memory updates, statistical Bayesian calibration that reduces false positives, and LoRA fine-tuning data export for deep model adaptation on Modal.

Challenges we ran into

Risk score calibration was the hardest problem. GPT-4o tends to overcall defects on heavy equipment because surface rust, cosmetic wear, and dirt are normal in field conditions. We had to build an extensive calibration system with detailed scoring guidelines, a heuristic backstop that catches irrelevant images, and Bayesian adjustment from inspector feedback.

Voice in noisy environments. Construction sites are loud. The Realtime API would sometimes interpret background machinery noise as speech in another language. We had to add explicit English-only constraints and semantic VAD tuning.

Image relevance. When technicians accidentally submit the wrong photo for a step, the model would still try to analyze it. We built a two-layer relevance check (GPT flag + heuristic phrase detection) to catch this.

What we learned

How Caterpillar's actual inspection workflow operates (from the 158-page operation manual and the Tom Zadek presentation)
How to build RAG systems that prioritize learned corrections over static knowledge
The power of Snowflake Cortex for running multiple AI functions directly inside the data warehouse
Deploying distributed applications across multiple DigitalOcean Droplets with private networking

What's next for CATscan

Predictive maintenance — use the Snowflake fleet data to predict component failures before they happen
Multi-model consensus — combine GPT-4o, Gemma, and Snowflake Cortex assessments for higher confidence scores
CAT marketplace integration — automatically generate parts orders when defects are detected, closing the loop from inspection to repair

Built With

digitalocean
fastapi
ffmpeg
gemma
nextjs
openai
postgresql
python
snowflake

Updates

Aryan Keluskar started this project — Mar 01, 2026 07:14 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.