Inspiration

We have all rage-quit an IKEA box at 11pm. Tiny diagrams, ambiguous arrows, part numbers that mean nothing. We wanted manuals you can actually see, narrated step by step, with a voice agent to ask when you get stuck.

What it does

Scan the barcode on a product box, or drop in a PDF. Assembli finds the manufacturer's manual, generates a real 3D model for every piece, figures out how each part moves at every step, and walks you through assembly in a 3D workspace with spoken narration. Hold the space bar to ask the voice agent a question or jump to any step.

How we built it

A Python backend pulls images and text out of the PDF, uses a Google vision model to detect steps and parts, removes backgrounds, deduplicates repeated pieces, generates textured 3D models with Tripo3D, and records a narration per step with ElevenLabs.

The frontend is a Next.js app with a Three.js workspace, a side panel that syncs the original PDF to the current step, and an open-source scanner that reads every common barcode and QR format from the phone camera.

For barcode lookup we use a Tinyfish web agent that searches the internet for the manufacturer's official manual in real time, streaming its live browser view back to the UI.

The voice agent captures speech in the browser, sends it to Gemini with the current step context, and speaks the reply back through ElevenLabs while auto-muting the step narration.

Challenges we ran into

Barcode databases cost a fortune. Every commercial lookup API wanted real money per request. We paired a free scanner with a web agent that searches the open internet, so instead of paying for one database, the entire internet became our database.

Every box uses a different code format. UPC, EAN, Code 128, QR, and often several at once. One scanner library handles them all. Fun fact: IKEA does not publish UPC codes publicly, but their internal article numbers sit inside the QR codes on every box, so we can still pull the correct manual.

Dynamic 3D animation. Every manual has different parts in different configurations, so nothing could be hand-authored. Our 3D service generates a model for each unique piece, and a vision model computes a position, rotation, and movement path per component per step.

Fitting the camera to every step. One step shows a screw, the next shows a half-built frame. Naively fitting the camera to the whole scene let far-flung parts inflate the bounds and push the camera back, making the assembly look like a speck. We drop tiny outliers from the zoom calculation, fit to a bounding sphere of what remains (same zoom from every angle), then extend the view distances so the outliers still render without clipping.

Voice latency. Our first voice-agent model thought too hard for a real-time loop. Swapping to a faster model with reasoning disabled and a tight output cap made the conversation feel instant.

Accomplishments that we're proud of

  • A generic pipeline that turns any product manual into a narrated 3D guide, with no per-product tuning.
  • A scan-to-3D-build flow that works on manuals we have never seen before.
  • A camera-fit that makes the 3D workspace usable across wildly different scales of parts.
  • A voice agent that can actually drive the UI.
  • Replacing a paid API layer with a web agent, cleanly.

What we learned

  • For real-time voice, shaving latency beats raw reasoning quality.
  • Web-scraping agents can replace paid APIs when the data is already public.
  • Bounding-sphere fits are underrated. They stay correct from every angle.
  • Cheap perceptual matching with a vision-model fallback is a great pattern for deduplication.
  • Modern vision models are surprisingly good at reading cluttered assembly diagrams.

What's next for Assembli

  • AR overlay. Project the current step onto the parts on your table through your phone camera.
  • Multi-language narration. Pipe a language selector through to the voice service.
  • Community manuals. Cache every uploaded guide for the next person who scans that barcode.
  • Parts-check on startup. Scan your pile of parts and get warned if a piece is missing before you open the bag.
  • Error recovery. If you put something on backwards, the voice agent notices and walks you back.

Built With

Share this project:

Updates