Inspiration

I'm a startup founder with three B2C consumer brands, and if you're in that world, you know the equation: you need content to sell your product. The videos you create capture attention, and you convert a percentage of that attention into sales. The more content you produce, the more attention you get, the more conversions you have.

So I became a content creator ~ not by choice, but because the businesses demanded it. And I quickly discovered how brutal the process actually is. I spent hundreds of hours learning CapCut, Adobe Premiere Pro, Canva. Masking, timelines, keyframing, color grading. I got decent at it. But the ROI never showed up, because content is a volume game and I was spending three to four hours editing a single clip that would get 200 views and die in 24 hours.

If you want to post three times a day, that's a full-time job. And if you already have a full-time job ~ running your brand and building your product ~ it's just not feasible. The only other option is hiring a production team, which costs thousands per month and adds a whole layer of management overhead you don't have bandwidth for.

I kept thinking: there has to be another way.

I code and build using Cursor, Windsurf and Claude Code. I love the UX. This abstraction layer has changed SWE and I wanted to bring the same experience to content creation. Instead of an IDE, I envisioned a CDE (Creative Development Environment) ~ where instead of generating markdown, it generates images; where instead of diffs, it edits videos; where instead of bundling packages, it compiles videos.

Personally, I admire editors and appreciate their art, just like how I appreciate developers and their code as an art. But I have no ambitions of being an editor or a coder. What I strive for is the ability to bring my visions to life ~ whether it's a SaaS product, a piece of writing, or a piece of video content ~ to me, it's all a form of creative expression.

I built Supanova to eliminate all the middlemen processes, tools, and skills that stand between a creator's intent and a finished video.

What it does

Supanova is an omni-modal agent specialised in video production . You describe what you want, and an autonomous agent creates it for you.

The agent orchestrates 29 tools across the entire production pipeline.

It can: 1) create subjects and characters, generate physical DNA for consistent appearances, 2) build model sheets, generate reference images, edit and regenerate images, 3) create video plans, manage scenes, generate scene images, 4) animate scenes into video clips, edit animations, 5) produce video specs, edit videos through conversation, 6) render final MP4s, check delivery status, 7) fetch web content for research, 8) and load specialised skills on demand.

The core idea is that you can start anywhere on the spectrum ~ from "I have a complete creative brief with assets and shot lists" all the way down to "I just have a vague idea" ~ and end up with a rendered video. The agent figures out what's missing and fills in the gaps.

Here's what happens when you chat with Supanova:

1) You describe your vision ~ something like "Create a 30-second lifestyle vlog of an AI influencer named Sofia exploring Paris." 2) The agent creates Sofia as a subject with consistent physical appearance ~ her DNA ~ so she looks the same across every scene. 3) It generates a reference image and model sheet, locking in her visual identity. 4) It designs a multi-scene video plan with narrative structure, pacing, and shot descriptions. 5) Gemini generates the scene images ~ Sofia at the Eiffel Tower, at a café, by the Seine ~ all visually consistent. 6) Veo animates those images into video clips with camera movements and natural motion. 7) The system assembles everything ~ transitions, text overlays, timing ~ into a VideoSpec.

You preview it live in your browser. Make edits through conversation: "make the opening punchier" or "slow down the middle section."

When you're happy, one click renders your final MP4. No timeline. No editing skills. Just conversation.

How we built it

Core design principles:

  1. The platform should feel alive ~ your creations should feel like they're coming to life as you work. I wanted to give users and the experience of bringing their visions to life
  2. Agent-human synchronicity ~ both agent and human should have total and equal autonomy, and their interactions should be in sync. If the user wants to take creative control, they can do so. If they want the agent to do all the work, this should be possible as well.
  3. Omni-modality ~ multi-modality + multi-provider + multi-skilled. Creators should have total freedom to express themselves. They should not locked into any set process or provider.

Building on these principles:

The technical architecture is a multi-model Gemini system. The core insight was that different parts of the video production pipeline need different capabilities ~ so instead of forcing one model to do everything, we orchestrate five Gemini models as specialists. Some examples:

1) Gemini 3 Flash/Pro powers the orchestrator agent ~ the brain that understands user intent, plans multi-step workflows, and coordinates the entire production pipeline. 2) Gemini 2.5 Flash + Imagen 4 handle image generation ~ scene images, reference photos, and model sheets for character consistency. 3) Veo 3.1 transforms static images into animated video clips with camera movements and natural motion. 4) Gemini 2.0 Flash serves as the translator ~ converting conceptual creative direction into technical rendering instructions. 5) Gemini 2.0/2.5 Flash provide vision analysis for multimodal understanding and visual consistency.

The frontend is Next.js 15 with React 19. The backend follows a service-oriented architecture with PostgreSQL. Video rendering uses Remotion with serverless export via AWS Lambda.

I designed the system around three key patterns: 1) an orchestrator-specialist model where the agent delegates to purpose-built Gemini instances, 2) two-stage generation pipeline where images are created first, reviewed, then animated ~ giving creative control at each stage and managing costs 3) loading skills at key points in the creative process ~ injecting specialised context to improve model outputs.

Challenges we ran into

1) Character consistency ~Early attempts had AI characters looking like completely different people in every scene. I solved this with the Subject DNA system ~ extracting and storing physical characteristics, then injecting them into every generation prompt. Model sheets with multiple angles made this reliable.

2) Coordinating 29 tools ~ Understanding what tools needed to be created, how they should interact, and orchestrating them all in a symphony ~ that was difficult. Context management alone was a challenge. At one point, I sent "hi" in the chat and the response cost 17,000 tokens. Managing all these tools, all these skills, being able to orchestrate them coherently required constant iteration.

3) Testing agents ~ You can't really unit test an agent the way you test normal code. It requires a lot of manual testing, putting them in different scenarios, seeing how they respond. That was incredibly time consuming.

4) First time with Remotion ~ I was building in the dark for a lot of it. I didn't know if what I was building was actually going to work until I tried it. There was a lot of faith involved.

5) Discovering the skills architecture ~ Once the core tools were built, I focused on improving outputs. That's when I discovered the need for a skills system. Loading particular expertise at certain points in the process. But implementing that meant refactoring the architecture I'd already built.

7) Managing costs. Generating videos isn't cheap ~ roughly 4p per second. I had to find a balance between rigorous testing and credit management, which meant being strategic about when and how I tested.

8) Finding composable patterns. Thinking about the tooling architecture, the agent-human interface, and trying to build patterns that could be repeatable rather than one-off solutions. That took time to get right.

9) Generative UI state management. The agent generates UI on the fly, but users also interact with that generated UI. Managing the bidirectional state ~ agent creates something, user modifies it, agent needs to understand the modification ~ was a subtle but significant challenge.

Accomplishments that we're proud of

1) It works end-to-end. Type a prompt, get a rendered video. The entire pipeline ~ from "create a cinematic sci-fi trailer" to downloading an MP4 ~ functions autonomously. That might sound obvious, but getting all the pieces to connect reliably was the hardest part.

2) Zero non-Gemini models in production. I went all-in on the Google AI stack. Orchestration, image generation, video generation, vision analysis, translation ~ all Gemini family. This wasn't a constraint, it was a choice. The models work together seamlessly when you design the system around their strengths. 29 tools, one conversation. The agent can create subjects, generate reference images, build video plans, produce scenes, animate clips, edit specifications, and render finals ~ all through natural language. No mode switching, no separate interfaces.

3) I built this solo. No team, no co-founders, just me, Claude Code and a vision. The entire architecture, every tool, every prompt iteration, every UI component ~ shipped in under a month.

4) It's not just localhost:3000 ~ it's live. Making something work in your local environment is one thing. Making it work in production is another. Testing, security considerations, attack surfaces, infrastructure ~ this isn't a demo sitting on my machine. It's a production-ready application, live at ~ iamsupernova.com.

5) The decision-making ~ Every build is a series of decisions, and when you're working with parts of the tech stack you don't know, trying to do something that hasn't been done before, you're navigating in the dark. I set out with product principles and technical principles, and I'm proud that I stayed true to both throughout. Those principles became guiding mechanisms ~ when I wasn't sure which way to go, I went back to the principles. And I think that discipline shows in the final product.

7) The CDE vision is real. I set out to build a Creative Development Environment that felt like Cursor or Windsurf but for video. Looking at what exists now, I think that vision is actually taking shape. It's not finished, but it's real.

What we learned

1) A renewed appreciation for the fundamentals. When I did CS50, I learned about C, binaries, zeros and ones, how images are encoded and decoded, compression, storage. You learn that a video is just one layer ~ underneath it's megabytes, and underneath that it's just zeros and ones structured in particular ways. I thought that was fascinating back then, but actually applying it now ~ manipulating video at the system level ~ brought that appreciation to a whole new place.

2) Respect for agentic IDEs. Building Supanova gave me a deep appreciation for Windsurf, Claude Code, and these agentic development environments. The amount of thought that goes into the agent is immense. And honestly, what they're doing is in some ways easier ~ coding tools are primitive and well-established. Grep searches, ls commands, cd to navigate. There's decades of history there. I was building tools for a domain without that foundation.

3) System prompts are everything when you can't fine-tune. The agent is non-deterministic ~ that's the beauty of it, it can adapt to many different situations. But when you want it to specialize in a particular domain without fine-tuning, you have to achieve that through system prompts. That's a craft in itself. Feedback loops for error recovery. There were so many instances where tools I created would fail. But I learned that if you include the reasoning for tool calls in the response, the agent has that data and can figure out how to proceed. So I stopped thinking "how do I prevent failures" and started thinking "how do I help the agent recover from failures." That shift was important.

4) Context engineering is just as important as prompt engineering. Especially if you want your agent to specialise in certain tasks. What you put in the context window ~ and what you leave out ~ shapes everything the agent does.

5) The decision tree from hell is navigable. When I started, I was overwhelmed by the sheer number of paths a user could take. They could have no assets, some assets, want to edit one video, produce something end-to-end ~ they could start literally anywhere. In typical UX, the paths are determined: landing page, button, signup, onboarding, dashboard, feature. Everything is mapped out. But with agentic design and generative UI, there are infinite entry points and infinite paths. I genuinely wondered if it was possible to handle all these scenarios gracefully. What I learned is that it is possible ~ you just need the right architecture and the right principles to guide the agent through the chaos.

What's next for supanova

I consider Supanova's current state a prototype ~ proof that the idea works and is possible. The end goal remains the same: going from creative intent to production-ready videos. But along this development process, I've discovered so many ways to improve the product, so many techniques I can iterate upon.

1) Visual consistency is solved. Voice consistency is next. Characters now look the same across every scene. The next frontier is making them sound the same too ~ consistent voices for AI characters, or cloning your own voice for narration. I know how to solve this, it's just a matter of building it.

2) No-assets is solved. Bring-your-own-assets is next. Right now, the product handles the case where you have nothing ~ you describe what you want, it generates everything. The next step is handling the case where you already have footage, photos, audio. Upload your assets, let the agent weave them together with generated content.

3) The editing experience is powerful. It's about to get more powerful. I've discovered techniques along this journey that I haven't implemented yet. The current editing capabilities work, but there's another level I can take it to. I'm not going to say what it is, but it's coming.

Built With

Share this project:

Updates