Project Horizon

GIF
First-person reconstruction of the opening ceremony from the speaker’s perspective.
AI Video Generation Pipeline
First-person reconstruction of a soccer player

Inspiration

Have you ever watched Messi and wondered what the game looks like from his point of view?

Project Horizon explores identity through perspective. In sports, a player’s identity is shaped not only by their actions, but by how they perceive the game: positioning, timing, awareness, and decision-making.

Reconstructing a first-person POV from third-person footage using novel machine learning techniques, we let viewers momentarily inhabit another identity.

What It Does

Project Horizon is an AI system that converts third-person videos into realistic first-person POV footage.
Users can select any player in a soccer video and experience the game exactly as that player would see it.

How We Built It

Input Video (Third-Person View)
We start with short third-person videos capturing target actions across diverse scenes.
Scene Understanding (backboard.io)
Video frames are sent to backboard.io, which routes them to Gemini and TwelveLabs to generate structured scene descriptions and identify the target subject and intended first-person viewpoint.
Depth-Based Prior Rendering
Each video is converted into depth maps using a monocular depth estimation model, producing a coarse 3D scene representation.
Using the detected subject location and camera geometry, we render a first person rough sketch video that encodes camera motion, spatial layout, and scene constraints.
Model Training
This model was trained by fine-tuning the Wan 2.1 Image-to-Video (14B) diffusion model using LoRA on a curated dataset of paired third-person and first-person videos.
This training explicitly teaches the model to learn the transformation from third-person inputs to realistic first-person camera trajectories and visuals.
Inference on H100 (Modal)
At inference time, the model is conditioned on the third-person video, the depth-based first-person prior, and the generated text prompt.
Inference runs on NVIDIA H100 GPUs via Modal, with cached base weights and LoRA adapters.

Challenges We Ran Into

Securing access to H100 GPUs for both training and inference
Extremely high compute costs and long-running jobs during experimentation

Accomplishments We’re Proud Of

Successfully trained and ran inference on large-scale video diffusion models for the first time
Built an end-to-end pipeline combining vision models, depth estimation, and generative video

What We Learned

How to be resourceful under constraints
After AWS rejected our compute requests, we discovered and leveraged Modal as a practical way to access H100 GPUs and unblock our project

What’s Next for Project Horizon

Further fine-tuning for additional domains such as robotics and embodied AI
Generating high-quality synthetic first-person video data for AI research labs
Exploring partnerships and commercialization opportunities, including applications in data generation for humanoid robotics companies like Figure AI and Neo

Built With

backboard
gemini
openai
python
react
typescript

Submitted to

UofTHacks 13
- Winner Overall 2nd place
- Winner Backboard.io: Memory Lane: Adaptive AI Journeys
- Winner TwelveLabs: Build The Future of Video Understanding with TwelveLabs