catsense

Homepage
Inspection
Walkthrough
checklist
Report
Report

Inspiration

One broken hose on a CAT excavator can cost $2.5 million in downtime. And right now, the way the industry catches these problems is a technician in heavy gloves trying to fill out a paper form on an active construction site.

They can't type easily. Reports come back vague, critical issues get missed, and machines go out when they really shouldn't.

We built CatSense to fix that — a tool that meets technicians where they are, hands full and time-pressured, and turns what they see and say into a professional report automatically.

What it does

CatSense replaces paper inspections with a simple voice-and-photo workflow completable in under 10 minutes.

The technician picks their machine — serial number, engine hours, and checklist load automatically. The inspection follows the real CAT walkaround: ground level, engine compartment, cab. At each checkpoint, snap a photo and talk: "Bucket teeth are broken, need replacement." Gloves stay on. 30 seconds per item.

On submission, the AI analyzes every photo and audio note against the equipment's service manual and generates a full structured report. It spotted cracked hydraulic hoses with exposed reinforcement → flagged Critical. Worn bucket teeth → Needs Attention. Every finding includes the evidence, severity, recommended action, and follow-up questions for anything the photo couldn't confirm.

What used to take 30–40 minutes now takes under 10 — and the reports are more detailed than anything written by hand.

How we built it

Frontend — React + TypeScript PWA (Vite). Enforces the real CAT inspection sequence, requires a photo per item, syncs evidence in real time via browser MediaRecorder API.
Backend — Cloudflare Worker (TypeScript) managing session state in R2, evidence uploads, and AI orchestration.
AI Pipeline — For each checklist item: retrieve relevant service manual excerpts from Actian VectorAI via cosine similarity search → build a grounded prompt with checklist context + manual constraints → call Gemini 2.5 Flash with images and audio as a single multimodal request → validate the JSON output with both Gemini's responseSchema and Zod.
RAG Service — Python FastAPI service. Manuals are chunked, embedded with Gemini text-embedding-004 (768-dim), and indexed in Actian VectorAI with IVFFlat cosine search.

Output schema enforces status ∈ {ok, needs_attention, critical}, confidence ∈ $[0,1]$, and requires evidence citation for every finding — never a fabricated value.

Challenges we ran into

Hallucination control. Keeping the model within the bounds of observable evidence required three layers: explicit prompt directives, Gemini's responseSchema for constrained generation, and Zod post-validation. Any single layer alone wasn't sufficient.

Audio + image in one API call. Sending both as inline base64 required chunked encoding to avoid call stack overflows inside the Cloudflare Worker's V8 isolate — a non-obvious constraint we hit mid-build.

Stateless Worker architecture. No persistent memory between requests meant all session state had to be externalized to R2 and designed for safe concurrent partial updates.

Accomplishments that we're proud of

A fully working end-to-end multimodal AI pipeline built in a single hackathon weekend
Triple-layer hallucination control that makes the output genuinely trustworthy, not just plausible-looking
The inspection flow mirrors the real CAT walkaround — we automated the workflow technicians already know, not a new one
The AI cites evidence, recommends actions, and asks follow-up questions — turning any technician into an expert-level inspector

What we learned

Multimodal grounding is harder than multimodal generation. Getting coherent output is easy — getting the model to stay within observable evidence, cite the right component, and recommend the manufacturer's correct action requires real infrastructure: a retrieval pipeline, structured checklists, and multiple validation layers.

We also learned that schema-constrained generation changes the integration contract fundamentally. When the model's output is structurally guaranteed before it arrives, you treat it as a typed data structure — not a string to parse — which eliminates an entire class of bugs.

What's next for CatSense

Fleet analytics dashboard — surface recurring findings by component across machines using the persisted JSONB report history in Actian
Offline-first PWA — queue evidence locally in IndexedDB, sync and analyze when connectivity is restored
Audio transcription — Whisper-based layer to make voice notes searchable and indexable
CAT asset management integration — dynamically resolve registered fleet machines and manual versions instead of static serial registration