ScribeMesh

Inspiration

Hardware documentation is fundamentally broken. Technicians, IT support, and engineers waste hundreds of hours writing static, flat manuals that immediately become outdated. While building hardware, we realized that video is the most natural way to document physical objects, but video alone is unsearchable and hard to reference. We wanted to bridge this gap: what if you could just film a motherboard or server rack with your phone, and let AI instantly convert it into an interactive, structured manual?

What it does

ScribeMesh turns any hardware video into interactive AI documentation in under 60 seconds.

You simply drag and drop a smartphone video (e.g., panning over a motherboard or server rack). ScribeMesh uses the Gemini 2.5 Flash API to analyze every frame, identifying components, chips, and ports. It then generates a sleek, Apple-minimal, dual-pane UI where every component gets a clickable card.

When you click a card, the video jumps to the exact timestamp where that component is visible. As the video plays, the relevant component cards highlight automatically. It also provides maintenance tips and exports directly to JSON for enterprise integration.

How we built it

We built ScribeMesh with a heavy focus on speed, aesthetics, and pure client/serverless architecture.

Frontend: Next.js 14 (App Router) with Tailwind CSS v4, utilizing a strict Apple-minimal design system with seamless light/dark mode transitions and micro-animations.
AI Engine: Google's Gemini 2.5 Flash API handles the complex multimodal reasoning. We convert the uploaded video to base64 and stream it to Gemini with a highly specific system prompt to return structured JSON tracking components and timestamps.
Video Editor (for Demo): We built an automated Python pipeline using edge-tts, moviepy, and HackClub AI models (NVIDIA Nemotron) to analyze our screen recording and generate the final voiceover and captions.

Challenges we ran into

Multimodal API Limitations: Next.js API routes default to a 4MB payload limit. We had to configure custom serverActions.bodySizeLimit in next.config.ts to support up to 100MB base64 video payloads seamlessly.
Video-to-UI Sync: Synchronizing the HTML5 <video> player's currentTime with a dynamically generated list of React components required precise state management to ensure the active card always matched what was on screen.
Aspect Ratio Agnosticism: We wanted the UI to look perfect whether the user uploaded a vertical phone video or a landscape desktop capture. We solved this using a strict flexbox layout with h-screen and object-fit: contain.

Accomplishments that we're proud of

Building a fully functional, end-to-end prototype from scratch in under 60 minutes.
Achieving a stunning, premium "Apple-minimal" UI that feels like a native app.
Successfully forcing Gemini 2.5 Flash to return strictly formatted JSON arrays for reliable frontend parsing without a heavy backend database.

What we learned

We learned that Gemini 2.5 Flash's long-context video understanding is remarkably accurate, even for dense, complex objects like PCBs. We also learned how to orchestrate multiple AI models (Gemini for the app, Nemotron + edge-tts for the demo video) to ship a complete product fast.

What's next for ScribeMesh

Live AR Mode: Point your phone camera at hardware and get live, overlaid documentation.
PDF Export: Generate print-ready manuals directly from the video.
Enterprise Integrations: Webhooks to sync ScribeMesh JSON directly to Jira, Notion, or internal wikis for MRO (Maintenance, Repair, and Operations) teams.

Built With

edge-tts
gemini
moviepy
next.js
python
tailwind-css
typescript

Updates

Atharv Mantri started this project — Jun 06, 2026 08:07 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.