Inspiration
We live in a visual world, but most knowledge is still locked in text or buried inside long, unsearchable videos. We’ve all been there: standing in the garage with greasy hands, trying to change a tire or fix a leaky pipe, frantically scrubbing through a 20-minute YouTube video just to find the 30 seconds of actual instruction.
We wanted to build FixIt Buddy to solve this. We wanted a tool that could look at the world with us—analyzing a video of a specific problem and instantly converting it into a safe, step-by-step guide. But we had a secondary inspiration: The Speed of Creation. We wanted to prove that with the right AI tooling, a complex, full-stack, multimodal application doesn't need to take weeks to build. We wanted to see if we could go from "Idea" to "Working Prototype" in under 20 minutes.
What it does
FixIt Buddy is a "Visual-to-Text" translator for the physical world.
- Input: Users upload a video (MP4, MOV, WebM) of a task they need to perform—like a flat tire, a broken appliance, or a furniture assembly kit.
- Analysis: The app utilizes Google’s Gemini 3 Flash API to perform frame-by-frame multimodal analysis. It doesn't just look for keywords; it watches the video to understand the physics and context of the scene.
- Output: It generates a clean, Markdown-formatted guide containing:
- Crucial Safety Warnings: (e.g., detecting if a car is on a slope before you jack it up).
- Tool List: Automatically identified from the video.
- Step-by-Step Instructions: Concise actions derived from the visual workflow.
How we built it
We built FixIt Buddy using Priset, our own AI-enabled software development tool that champions the "Glass Box" philosophy.
- The Engine: We used Gemini 3 Flash for the core logic because of its massive context window and speed in processing video tokens.
- The Stack: A React frontend connected to a Node.js backend.
- The Workflow: Instead of manually typing boilerplate, we used Priset to scaffold the entire application architecture via natural language prompts.
- The "Director" Mode: When we encountered UI glitches or runtime errors, we didn't just paste logs. We took screenshots of the broken UI, fed them back into the IDE via Priset, and used visual prompting to style the output and fix logic errors in real-time.
Challenges we ran into
- Gemini prompt refinement: Early on, we struggled with extracting specific steps from continuous video footage. The model would sometimes summarize too broadly. We had to refine our system prompts to force Gemini to think like a "Safety Instructor" rather than a "Movie Critic."
- Real-time Debugging: During the build (as captured in our demo video), we hit a server-side warning and a frontend runtime error regarding the video player state.
Accomplishments that we're proud of
- The "Safety First" Feature: We were thrilled when the app correctly identified a safety hazard (changing a tire on a slope) without being explicitly told to look for it. It proved the genuine intelligence of the Gemini multimodal capabilities.
- Speed to Deploy: We went from a blank repository to a fully functional app with error handling in 18 minutes. The UI styling & logo creation took another day.
- The "Glass Box" Fix: Successfully using a screenshot of a runtime error to guide the AI into writing a surgical code fix. It was a perfect example of human-AI collaboration.
What we learned
- Multimodal is the new Standard: Text-only context is insufficient for real-world tasks. The ability of Gemini 3 Flash to "see" the car jack and the slope changed the utility of the app entirely.
- Syntax is no longer the bottleneck: Building this app reinforced our belief that the future of engineering isn't about typing code; it's about architectural direction and creative problem-solving.
- Video Token Efficiency: We learned a lot about optimizing video file sizes (limiting to <50MB) to balance upload speed with API token usage.
What's next for FixIt Buddy
- Voice Interaction: Adding a "Talk Back" feature so users can ask, "Buddy, repeat step 3," while their hands are busy working.
- AR Overlay: Moving from a static list to an Augmented Reality view where the instructions are pinned to the object in the video.
- Priset Integration: We plan to add FixIt Buddy as a canonical example to the Priset documentation to teach developers how to use visual prompting to build multimodal apps (and how not to panic when a runtime error occurs ;-).
Built With
- and-visual-debugging.-frontend:-react-(vite)
- code-generation
- express.js
- gemini
- muiter
- node.js
- priset
- react
- tailwind
- tailwind-css.-backend:-node.js
- typescript
- vite
Log in or sign up for Devpost to join the conversation.