Inspiration
Hardware documentation is fundamentally broken. Technicians, IT support, and engineers waste hundreds of hours writing static, flat manuals that immediately become outdated. While building hardware, we realized that video is the most natural way to document physical objects, but video alone is unsearchable and hard to reference. We wanted to bridge this gap: what if you could just film a motherboard or server rack with your phone, and let AI instantly convert it into an interactive, structured manual?
What it does
ScribeMesh turns any hardware video into interactive AI documentation in under 60 seconds.
You simply drag and drop a smartphone video (e.g., panning over a motherboard or server rack). ScribeMesh uses the Gemini 2.5 Flash API to analyze every frame, identifying components, chips, and ports. It then generates a sleek, Apple-minimal, dual-pane UI where every component gets a clickable card.
When you click a card, the video jumps to the exact timestamp where that component is visible. As the video plays, the relevant component cards highlight automatically. It also provides maintenance tips and exports directly to JSON for enterprise integration.
How we built it
We built ScribeMesh with a heavy focus on speed, aesthetics, and pure client/serverless architecture.
- Frontend: Next.js 14 (App Router) with Tailwind CSS v4, utilizing a strict Apple-minimal design system with seamless light/dark mode transitions and micro-animations.
- AI Engine: Google's Gemini 2.5 Flash API handles the complex multimodal reasoning. We convert the uploaded video to base64 and stream it to Gemini with a highly specific system prompt to return structured JSON tracking components and timestamps.
- Video Editor (for Demo): We built an automated Python pipeline using
edge-tts,moviepy, and HackClub AI models (NVIDIA Nemotron) to analyze our screen recording and generate the final voiceover and captions.
Challenges we ran into
- Multimodal API Limitations: Next.js API routes default to a 4MB payload limit. We had to configure custom
serverActions.bodySizeLimitinnext.config.tsto support up to 100MB base64 video payloads seamlessly. - Video-to-UI Sync: Synchronizing the HTML5
<video>player'scurrentTimewith a dynamically generated list of React components required precise state management to ensure the active card always matched what was on screen. - Aspect Ratio Agnosticism: We wanted the UI to look perfect whether the user uploaded a vertical phone video or a landscape desktop capture. We solved this using a strict flexbox layout with
h-screenandobject-fit: contain.
Accomplishments that we're proud of
- Building a fully functional, end-to-end prototype from scratch in under 60 minutes.
- Achieving a stunning, premium "Apple-minimal" UI that feels like a native app.
- Successfully forcing Gemini 2.5 Flash to return strictly formatted JSON arrays for reliable frontend parsing without a heavy backend database.
What we learned
We learned that Gemini 2.5 Flash's long-context video understanding is remarkably accurate, even for dense, complex objects like PCBs. We also learned how to orchestrate multiple AI models (Gemini for the app, Nemotron + edge-tts for the demo video) to ship a complete product fast.
What's next for ScribeMesh
- Live AR Mode: Point your phone camera at hardware and get live, overlaid documentation.
- PDF Export: Generate print-ready manuals directly from the video.
- Enterprise Integrations: Webhooks to sync ScribeMesh JSON directly to Jira, Notion, or internal wikis for MRO (Maintenance, Repair, and Operations) teams.
Built With
- edge-tts
- gemini
- moviepy
- next.js
- python
- tailwind-css
- typescript
Log in or sign up for Devpost to join the conversation.