Inspiration

Everyone who has ever assembled flat-pack furniture knows the pain — confusing arrows, unclear diagrams, and the eternal question: which screw is this?
We wanted to turn that frustration into clarity. Inspired by IKEA manuals, we asked: What if AI could actually read an instruction manual and bring it to life in 3D?
That idea became the foundation for our project — transforming static manuals into interactive, visual, and conversational guides anyone can understand.


What It Does

Our system transforms furniture manuals (PDFs or images) into interactive 3D visualizations with voice-powered assistance.

Upload a manual and the system automatically:

  • Extracts parts, step numbers, and text using Gemini 2.5 Pro for structured understanding
  • Generates Scene JSON and step code via Gemini 2.5 Flash
  • Renders step-by-step animations with labeled components
  • Lets users talk to their manual — for voice commands and spoken responses

The result: a blueprint-style 3D visualization showing each assembly step clearly, with part callouts and thumbnails for quick reference.


Some Perspectives From A First Time Hacker (Senan's Writings and Ramblings)

This was my first hackathon, and before we get to the technical part of it all, I’d love to share some thoughts.

I’ve never attended a hackathon before, thus that first-time hacker badge which I wear with pride (metaphorically — no badges were given out, unfortunately). This isn’t to say I haven’t built software — over the past 4 years of my university career, I’ve built out a great many software projects, both on my own and in class, some of which I even launched and got real users for. I’ve also had the privilege of 6 internships where I worked on building digital products at scale with thousands to millions of DAU.

You may think this has prepared me for a hackathon — if you do, you’re wrong. A hackathon is a different beast entirely. You meet strangers over Discord (in my case at least), you spend an hour thinking of an idea, and then you get to building. No clear product requirements or sprint plans. No such thing as a QA team or in-depth code reviews — it’s a run and pray to whichever higher power you do or don’t believe in that it’ll work, most of the time.

Now, despite having spent the past 8 years of my life coding in some capacity, this happens to be the first time I’ve done so while I’m on 3 hours of sleep and an uncountable number of Red Bulls — I lost track after 6. It’s weird that in a weekend I’ve learned more about myself and building than at almost any other point in my professional and academic career thus far. Learning to delegate tasks with people I’ve known for an hour, trying to parse through pages of documentation, and stitching together what I can with the wonders of vibe coding due to some unbelievable time constraints — it’s certainly an experience, and one I’m glad to have had.

As I write this, it’s 3:21 a.m., and I’ve been awake for some 22 hours. And yet this small building in a big city is roaring alive with kids from all over the world. All working away toward the same goal. The energy of the people around me is one I don't think I have the words to describe. It's reignited a passion for building and developing that I almost forgot I had. It’s a surreal feeling and a reminder of just how great a weekend this has been. From meeting amazing people in industry and academia (my glorious king David Malan), getting to see a campus that’s 230 years older than the country I’m from (Leafs fan for life, by the way), to scouring an entire building to find a desk comfy enough to sleep on — and most importantly, learning more than I ever thought I could in the span of 36 hours. I’m lucky to be leaving this hackathon a better developer and a more ambitious person than when I came in.

Thanks, HackHarvard, and all the amazing folks who made this weekend possible :)


How We Built It

Frontend

  • Next.js + React + TypeScript for a clean, minimal interface centered on 3D visualization
  • Three.js renders procedural step animations from Gemini’s generated Scene JSON
  • Web Speech API handles real-time voice input and text-to-speech playback

Backend

  • Next.js API routes orchestrate uploads and extraction (see frontend/app/api/extract-steps/route.ts)
  • An Express + TypeScript service powers Gemini interactions
  • Google Generative AI SDK bridges backend calls to Gemini 2.5 Pro (for extraction) and Gemini 2.5 Flash (for code generation)

Pipeline

  1. PDFs are converted to page images via PDF.js
  2. Text and diagram data are extracted
  3. Gemini 2.5 Pro identifies tools, parts, and instructions
  4. Gemini 2.5 Flash generates Scene JSON describing 3D structures and assembly sequences
  5. The frontend renders dynamic Three.js visualizations for each step

Challenges We Faced

  • Parsing diverse and inconsistent manual layouts
  • Getting Gemini to produce consistent and valid Scene JSON
  • Designing minimal 3D representations that still convey assembly detail
  • Implementing a pseudo-sandboxed 3D runner that safely executes generated code inside the browser
  • Making voice interactions feel responsive despite latency from text processing

Accomplishments We’re Proud Of

  • Automatically structuring assembly data from unmodified PDF manuals
  • Rendering clean, animated 3D sequences that mirror instruction steps
  • Prototyping a voice interface that enables natural spoken queries
  • Delivering a cohesive full-stack system that runs end-to-end on real IKEA manuals

What We Learned

  • Gemini Pro handles spatial reasoning surprisingly well for 2D-to-3D mapping
  • In-browser code execution needs strict guardrails even when sandboxed by convention
  • Minimalism in technical drawing-style visualization improves clarity over photorealism

What’s Next

  • Wire the voice interface to live Gemini QA responses for real conversational help
  • Integrate ElevenLabs AI voices for natural, expressive speech
  • Add multilingual support and simplified steps for accessibility
  • Introduce AR and VR modes, letting users project assembly instructions directly onto their workspace or view full-scale models in immersive 3D
  • Export sequences to GIF or MP4 for retailer product pages
  • Expand beyond furniture to cover any type of technical manual

Our vision: make every manual interactive, understandable, and immersive.

Built With

+ 2 more
Share this project:

Updates