Inspiration

Have you ever bought the perfect sweater online, only for it to not fit like you wanted? That's the core problem with online shopping today: you can see the item, but you can't see it on yourself We were inspired by the idea of making online try-on feel more immediate and personal, allowing users to stand in front of a camera, describe the look they want, and see that style direction on themselves right away.

What it does

Model Studio lets a user record a short video of themselves, enter a prompt describing the style or clothing change they want, and receive back a transformed version of that clip. The goal is to let someone explore new outfits, silhouettes, and aesthetics before buying anything or changing clothes physically.

On the product side, the app gives users a clean flow for:

  • recording a short video
  • entering a styling prompt
  • sending the video to the backend for processing
  • reviewing the resulting transformed clip
  • tweaking the clothing based on that clip

On the model side, the system takes the user prompt and past user prompts for context, queries a web-scraped database of 30+ retail stores to find the closest matching garment, and processes a 24 fps video frame by frame through a multi-stage pipeline. Each frame is analyzed and prepared with clothing-focused computer vision steps such as detection, masking, cloth captioning, and purpose-built generation models designed for dressing a subject from an image. The frames are processed individually while preserving useful context for visual consistency, then stitched back together into a final video output.

How we built it

We built the user-facing app with a modern web stack that makes the experience feel simple and direct. The frontend handles camera capture, prompting, state transitions, and review. The backend accepts uploaded video, prepares jobs, extracts frames, runs the generation pipeline, and returns the processed output. The frontend is React, backend is FastAPI.

The deeper technical work is in the image-to-video transformation pipeline. We designed an 8-part backend process that takes each frame from a 24 fps video and runs it through several stages, including:

  • subject and clothing detection
  • masking and segmentation
  • garment-aware captioning and prompt enrichment
  • generation models designed specifically for putting clothing onto a person from a photo consistency-aware per-frame processing
  • video reconstruction from processed frames

That meant solving both application engineering and model orchestration problems at the same time. We were not just building a UI around an API. We were building a full system that handles capture, preprocessing, inference, consistency, and video output end to end.

Challenges we ran into

One of the biggest challenges was our original ambition: live replacement. We wanted to support something much closer to real-time clothing transformation, where the user could see outfit changes happen almost instantly. In practice, that turned out to be extremely difficult.

The computational burden was too high. Each frame requires heavy vision and generation steps, and when you multiply that across an entire video, the cost becomes significant very quickly. Running that pipeline live would require much more powerful infrastructure, much tighter optimization, and a much larger budget than was realistic for this project.

It was also technically hard for reasons beyond raw speed. Clothing try-on is not just a style transfer problem. The system has to reason about body shape, pose, garment boundaries, texture, overlap, and consistency from frame to frame. Small errors become very visible once the frames are played back as video. A result that looks acceptable on one still image can fall apart in motion.

We also ran into practical engineering issues:

  • keeping generated clothing stable across adjacent frames
  • making stitched video playback look correct at the original speed
  • balancing quality against processing time
  • getting multiple models and preprocessing steps to work together reliably
  • running meaningful experiments on limited hardware
  • In the end, we had to make a realistic product decision: instead of true live replacement, we focused on making delayed but high-quality per-frame generation actually work.

Accomplishments that we're proud of

We are especially proud that the per-image processing pipeline actually worked. We were able to get the image-generation side running locally on a laptop, even if it was slow, and that alone was a big win given how demanding these models are.

We are also proud that we were able to get more out of the models and supporting functions than we initially expected. What started as an ambitious idea became a working pipeline that could both process real recorded video and preserve enough continuity to stitch the output back into motion

Beyond the model work, we are proud that Model Studio became a usable experience rather than just a technical demo. We built the surrounding product layer needed to make the idea understandable and interactive.

What we learned

We learned that fashion AI in video is much harder than fashion AI in still images. Once motion is involved, consistency becomes one of the most important and difficult parts of the entire problem.

We also learned that building around AI is just as important as building the AI itself. The user experience, the backend pipeline, the frame extraction, the prompt handling, and the reconstruction process all matter if you want the final system to feel coherent.

Most importantly, we learned how to scope ambition without losing the idea. Real-time virtual try-on sounded exciting, but moving to a slower frame-by-frame pipeline let us build something real, testable, and impressive within our constraints.

What's next for Model Studio

The biggest change we want to make for Model Studio is to fulfill the original vision of live-clothing replacement. We wanted to have a live video stream with a very small delay where the user could see cloth changes live, and use their voice (Connected with 11 labs) during the stream in order to change it live. Adding this as a chrome extension would allow a user to try on cloths as they are shopping.

Additionally, integration with proper e-commerce websites like H&M allows users to try on clothing that actually exists, and ideally, we would return a link the the final clothing that the user try on, allowing them to immediately buy it.

Built With

Share this project:

Updates