Rewind

Inspiration

Watching a grandmother struggle to understand her granddaughter's birthday video because it was in English, not Spanish, broke our hearts. That moment exposed a painful truth: millions of precious memories are trapped behind language barriers, and traditional video playback keeps us as passive observers rather than active participants in our own memories.

We asked ourselves: what if you could literally step into your videos and explore them like a 3D space? What if your voice could transcend language barriers while keeping its emotional soul intact?

Rewind was born to transform how we experience memories—making them explorable in 3D and accessible in any language, all while keeping you at the center through your own cloned voice.

What it does

Rewind transforms ordinary videos into extraordinary experiences through three innovations:

3D Spatial Exploration - Converts 2D videos into navigable 3D point clouds. Step inside your memories, explore frozen moments from any angle, and click on objects to hear what's happening. It's like being inside a photograph.

AI Scene Understanding - TwelveLabs detects objects, people, and actions automatically. Google Gemini generates natural descriptions of each scene. Search your memories like "show me all moments with Emma smiling."

VoiceBridge Technology - ElevenLabs clones your voice from just 30 seconds of audio. Translate scene descriptions into 29+ languages while preserving your unique vocal signature. Now your grandmother in Mexico can hear your voice narrating in perfect Spanish.

The result? Universal accessibility without losing authenticity. Your memories become explorable 3D spaces that anyone, anywhere, can understand—in your voice.

How we built it

Frontend - React 18 with Three.js for WebGL-powered 3D rendering. Tailwind CSS for our cosmic-themed glassmorphic UI. Canvas API for animated star fields and particle systems.

Backend - FastAPI handling async operations. FFmpeg extracting video frames. Firebase managing auth, storage, and database.

AI Pipeline - TwelveLabs for video analysis. Google Gemini for scene descriptions and translation. ElevenLabs for voice cloning and synthesis. MiDaS for depth estimation from 2D frames.

Key Flow:

User uploads video → Firebase Storage
FFmpeg extracts frames → MiDaS generates depth maps
TwelveLabs analyzes scenes → Gemini creates descriptions
User records 30-second voice sample → ElevenLabs clones voice
User selects language → Gemini translates → ElevenLabs synthesizes in cloned voice
Three.js renders 3D memory space with interactive narration

Team Division:

Peace Enesi: 3D rendering and depth processing pipeline
Ohinoyi Moiza: Frontend UI and VoiceBridge interface
Joanna Chimalilo: Backend API and AI service integration

Challenges we ran into

Depth Estimation Inconsistencies - Monocular depth from 2D videos produced flickering artifacts with moving objects. We implemented temporal smoothing algorithms and hybrid depth calibration to stabilize the 3D reconstruction.

3D Performance on Lower-End Devices - Rendering thousands of point cloud vertices caused frame drops. We added Level-of-Detail systems, instanced rendering, frustum culling, and Web Workers for background processing to maintain 60fps.

Voice Quality Across Languages - ElevenLabs worked perfectly for English but lost emotional nuance in tonal languages like Mandarin. We fine-tuned parameters per language family and adjusted prosody preservation techniques.

Narration Generation Latency - The full pipeline (translate → synthesize → upload) took 8-12 seconds. We implemented aggressive caching by (scene_id, language, voice_id), pre-generated demo narrations, and added animated loading states to improve perceived speed.

Dependency Conflicts - React 19, Three.js 0.180, and @react-three/fiber had breaking peer dependency issues. We systematically downgraded to stable versions (React 18.3, Three.js 0.160, @react-three/fiber 8.15) and used legacy peer deps flags.

Disk Space Crisis - During development, we hit 100% disk usage which blocked npm installs. Had to aggressively clear caches, delete old node_modules folders, and manage storage throughout the hackathon.

Accomplishments that we're proud of

It Actually Works - We built a functional demo that chains three complex AI services (TwelveLabs → Gemini → ElevenLabs) with real 3D rendering. You can upload a video, clone your voice, and hear yourself speaking French.

VoiceBridge Technology - The emotional impact of hearing your own voice speaking a language you don't know is magical. We created something that preserves human connection across language barriers.

Award-Winning Design - Our cosmic-themed landing page with glassmorphic UI, animated star fields, and smooth Three.js orb animations looks production-ready. The wormhole effect in the final CTA is mesmerizing.

Real-Time 3D Performance - Optimizing Three.js to render complex point clouds at 60fps on various devices taught us advanced graphics programming we'll use forever.

24-Hour Full-Stack Build - We went from concept to deployable demo in one hackathon, with working frontend, backend, AI pipeline, and infrastructure. The team collaboration was seamless despite working on different continents.

Solving Real Problems - This isn't just cool tech—it solves actual pain points. Families separated by language, educators reaching global students, content creators accessing international audiences. We built something meaningful.

What we learned

Technical Deep Dives

Advanced Three.js optimization: instanced rendering, LOD systems, shader programming
AI service orchestration with proper error handling and retry logic
Web Audio API for real-time waveform visualization
Monocular depth estimation and 3D reconstruction techniques
Firebase architecture for real-time collaborative applications

Product Insights

The most powerful technology preserves human emotion—voice cloning resonates because it keeps you in the narration
Accessibility doesn't mean compromise—you can make content universal without losing authenticity
Demo-driven development works—we built for the pitch first, ensuring every feature tells a story

Hackathon Strategy

Clearly defined roles prevent overlap and enable parallel development
Documenting decisions in README files saved hours of repeated explanations
Distinguishing between "vision" and "24-hour demo" kept us focused
Sometimes downgrading packages is smarter than fighting cutting-edge bugs

Team Collaboration

Async communication across time zones requires crystal-clear documentation
Trust your teammates' domain expertise—micromanaging kills velocity
Celebrate small wins during the grind—they keep morale high at 3am

What's next for Rewind

Immediate Features

Complete MiDaS depth pipeline for any uploaded video (currently demo scenes)
Mobile AR app - explore memories in your living room with phone camera
Social sharing with embedded narrations
Batch processing for multiple videos at once

Advanced Capabilities

VR integration for fully immersive memory exploration (Oculus, Vision Pro)
Real-time collaboration—multiple users exploring the same memory space together
AI-powered memory search: "Show me all moments with grandma smiling" or "Find the part where we opened gifts"
Emotion detection to adjust narration tone based on facial expressions
Voice aging - hear how your childhood voice would sound narrating recent memories

Enterprise Applications

Sports analysis platforms for coaches reviewing plays in 3D
Medical training for surgical procedure exploration from any angle
Real estate virtual tours with multilingual narration
Corporate training videos accessible in employees' native languages
Documentary filmmaking with interactive 3D exploration

Community Impact Our vision is simple: language should never be a barrier to sharing life's precious moments. Every grandmother should hear her grandchild's laughter in her native tongue. Every family separated by borders should feel connected through shared memories. Every memory deserves to be explored, not just watched.

Rewind is just the beginning of universal memory accessibility.