Inspiration

Caterpillar performs over 5 million inspections per year across 50,000 field technicians. Today, that process involves paper checklists, manual photo uploads, and hours of technician time per machine. A single missed defect on a mining truck that runs 24/7 can cost hundreds of thousands in downtime. We asked: what if an AI could walk alongside the technician, see what they see, hear what they say, and generate the entire report automatically?

We built an end-to-end system that handles the full lifecycle: from a technician pointing their phone camera at the machine, through AI-powered defect detection, all the way to executive fleet analytics dashboards.

What it does

CATscan is an AI-powered inspection assistant for CAT heavy equipment. A technician opens the web app on their phone, starts a guided walkaround, and the system:

  1. Guides them step-by-step through the inspection with voice prompts (ElevenLabs TTS), telling them exactly what to photograph and check at each station
  2. Analyzes each photo in real-time using GPT-4o Vision + RAG retrieval from the 158-page CAT operation manual (embedded in Actian VectorAI DB)
  3. Listens to the technician via OpenAI Realtime API (WebRTC) — they can speak hands-free while wearing gloves in extreme conditions
  4. Assigns continuous risk scores (0-100) for every component with cited evidence from the manual
  5. Generates a structured inspection report matching Caterpillar's actual report format
  6. Produces executive summaries via Snowflake Cortex, translates reports for international dealers, and analyzes inspector urgency from voice sentiment
  7. Learns from corrections — a 3-layer continual learning system (RAG memory, Bayesian calibration, LoRA fine-tuning) improves accuracy over time
  8. Identifies parts visually — snap a photo of a worn component, get ranked CAT part numbers with fitment certainty
  9. Tracks fleet health over time — Snowflake warehouse stores every inspection for fleet-wide risk trends and predictive maintenance

How we built it

Frontend: Next.js 16 with React 19, TypeScript, and Tailwind CSS. Mobile-first design optimized for one-handed operation in the field.

Backend: FastAPI (Python 3.12) handling video processing (FFmpeg frame extraction), audio transcription (Whisper), and the inspection pipeline.

AI Pipeline: GPT-4o Vision analyzes each frame with context from RAG retrieval. The operation manual is chunked, embedded with All-MiniLM-L6-v2, and stored in Actian VectorAI DB. Each finding cites the specific manual reference that justifies the risk score.

Voice: ElevenLabs for guided TTS prompts. OpenAI Realtime API (WebRTC) for hands-free conversational inspection. On-device Gemma 3n (via OpenRouter) for offline-capable knowledge queries.

Snowflake Cortex AI (5 functions):

  • COMPLETE() — executive summaries and natural language fleet queries ("Which machine has the highest failure rate?")
  • SUMMARIZE() — condense long inspection audio transcripts
  • TRANSLATE() — multi-language reports for CAT's global dealer network (Spanish, French, Portuguese, German, Japanese, Korean, Chinese)
  • EXTRACT_ANSWER() — Q&A over historical inspection data
  • SENTIMENT() — analyze inspector voice transcripts for urgency and concern levels

DigitalOcean Infrastructure:

  • 3 Droplets (frontend, backend, AI worker) on a private VPC in nyc3
  • Managed PostgreSQL for persistent inspection history
  • Gradient™ AI Inference Cloud for indexing CAT documentation
  • Automated deployment via deploy.sh with Docker

Continual Learning: Inspector corrections feed back into three layers — immediate RAG memory updates, statistical Bayesian calibration that reduces false positives, and LoRA fine-tuning data export for deep model adaptation on Modal.

Challenges we ran into

Risk score calibration was the hardest problem. GPT-4o tends to overcall defects on heavy equipment because surface rust, cosmetic wear, and dirt are normal in field conditions. We had to build an extensive calibration system with detailed scoring guidelines, a heuristic backstop that catches irrelevant images, and Bayesian adjustment from inspector feedback.

Voice in noisy environments. Construction sites are loud. The Realtime API would sometimes interpret background machinery noise as speech in another language. We had to add explicit English-only constraints and semantic VAD tuning.

Image relevance. When technicians accidentally submit the wrong photo for a step, the model would still try to analyze it. We built a two-layer relevance check (GPT flag + heuristic phrase detection) to catch this.

What we learned

  • How Caterpillar's actual inspection workflow operates (from the 158-page operation manual and the Tom Zadek presentation)
  • How to build RAG systems that prioritize learned corrections over static knowledge
  • The power of Snowflake Cortex for running multiple AI functions directly inside the data warehouse
  • Deploying distributed applications across multiple DigitalOcean Droplets with private networking

What's next for CATscan

  • Predictive maintenance — use the Snowflake fleet data to predict component failures before they happen
  • Multi-model consensus — combine GPT-4o, Gemma, and Snowflake Cortex assessments for higher confidence scores
  • CAT marketplace integration — automatically generate parts orders when defects are detected, closing the loop from inspection to repair

Built With

Share this project:

Updates