Creative.AI

Drop a static ad. Get a Health Score, a real diagnosis, a 14-day forecast, an AI rebuild, and a live coach you can actually argue with — before you spend a dollar serving it.

Inspiration

The mobile advertising loop is famously wasteful. You ship a creative, wait days for enough impressions to mean anything, watch the numbers decay, scramble to replace it, and start over. Every team we know lives this cycle. We wanted to answer one question: can you tell whether an ad will work from the visual decisions alone, before any spend touches it?

The Smadex Creative Intelligence dataset gave us a rare window into that mapping — over a thousand mobile creatives joined to hundreds of thousands of rows of daily performance data, across dozens of advertisers and several verticals. That's enough signal to learn what a good ad actually looks like, not just guess.

What it does

You drop in a screenshot. Within seconds, the app returns a Health Score from zero to a hundred and a recommended action — scale it, keep running it, pivot, or pause — backed by a structured diagnosis written by a fine-tuned vision-language model. The diagnosis names what's working, what's hurting performance, why the creative is at risk of fatigue, and the single change that would help most.

Alongside that, you get a fourteen-day forecast of how CTR and ROAS are likely to evolve, a color palette pulled from the actual top-performing ads in your vertical, and an AI-rebuilt version of your creative that follows the model's own brief instead of hallucinating "better" from scratch. And if you want to dig deeper, there's a live draw-and-chat coach. Circle any element on the ad and Maya, our senior-creative-director persona, tells you whether it's working or pushes back with a concrete fix. You can chat with her, argue with her, ask why.

How we built it

The core of Creative.AI is three personalized models that chain into one another, each handing its output to the next.

The first is a soft-voting tabular ensemble. It learns the mapping from a multi-dimensional "creative genome" — tabular metadata, early-life performance signals, visual rubric scores, and image embeddings — to a four-class outcome. We spent real time on the data audit before we trained anything: hunting down columns that leaked future outcomes, deduplicating creatives, and splitting on campaign so the model can never memorize a campaign in training and recognize it in test.

The second model is a personalized vision-language model. We took SmolVLM-Instruct and ran a true full fine-tune — every parameter trainable, no adapter, no LoRA shortcuts — adapting the entire network to creative analysis. The training loop is self-distillation: the model first learns from pseudo-labels generated by a stronger external teacher, and then enters a self-improvement phase where the same network plays both student and teacher — the teacher just sees a richer, few-shot prompt. Because both roles are the same model, they improve together, and the network learns the new task without forgetting what it already knew.

The third is the image rebuilder. Most "AI rebuild your creative" tools fail because they let an LLM hallucinate what "better" looks like. We don't. The supervision comes from the ensemble's own counterfactual brief — the model says exactly which levers would help, in what direction. A teacher image model renders the improved variant, the ensemble re-scores it, and only the pairs that genuinely improved survive into training. We then fine-tune Flux with a low-rank adapter, followed by a preference-learning pass weighted by the observed lift. The preference signal is our own model, not human raters.

Wrapping it all is a frontend that ships with everything precomputed, so the gallery, stats, and explorer pages render even with the backend off. Live AI calls go through OpenRouter on top of fast multimodal endpoints. Maya, the live coach, doesn't just see the circled region — she sees the entire campaign context, including the predicted status, the health score, the structured weaknesses, and the palette. That grounding is what keeps her replies specific instead of generic.

Challenges we ran into

The hardest part wasn't the models — it was getting them to be honest. Early versions of our counterfactual engine reported absurdly large lifts because the cohorts they anchored to were too small to be meaningful. We rewrote it to require a minimum cohort size and to cap the lift it would ever claim, even if the math suggested otherwise.

The fourteen-day forecast had a similar story: a per-sample regressor we trained didn't generalize cleanly, so instead of shipping a precise-looking-but-misleading prediction, we ship the average of real curves from similar creatives. Less impressive sounding, more truthful.

LLMs were a constant source of edge cases. They emit markdown fences, trailing commas, single quotes, unquoted keys, and truncated outputs — sometimes all at once. We ended up writing a multi-stage progressive-repair JSON parser that walks down through repairs only when the previous one fails. We also fought a sneaky React StrictMode bug where an in-flight image generation was being silently discarded by a cleanup function on the second mount.

And there were UX details that took longer than we expected. The lasso tool in the live coach kept auto-closing into a weird straight segment. The chat with Maya started out generic before we wired the campaign context through. The AI improver was first a popup modal before we made it a full screen transition. Every one of those was a small thing that mattered a lot.

Accomplishments we're proud of

The pipeline is leakage-free from raw CSVs to trained ensemble, and reproducible end-to-end on a laptop in a few minutes. The three models chain coherently — the ensemble's brief drives the image edit, the analysis grounds the live coach, the palette gates the AI editor. Nothing is bolted on.

We're especially proud of the live coach. Most demo "AI assistants" feel like a wrapper around a generic chat model. Ours actually cites the predicted status, the health score, the specific weakness, and the palette in her replies. She remembers what she already said and refuses to repeat herself. She rotates the lens she's reviewing through — eye-flow, contrast, fatigue, novelty — so back-to-back tips never sound the same. It feels like talking to someone who has read the brief.

What we learned

We learned that synthetic datasets are great teachers and dangerous benchmarks — they let you iterate on architecture for free, but you can't trust the headline numbers as production performance. We learned that self-distillation actually works when you don't have more labeled data to throw at a model. And we learned that the most important decision in a "preference learning" pipeline is not the algorithm but the source of the preference signal. Using our own ensemble's lift score as the preference, instead of human raters or aesthetic proxies, was the move that made the AI rebuild meaningful.

On the engineering side, we learned the same lesson everyone learns: the model is a small fraction of the work. Data audits, leakage hunts, JSON repair, UX details, and honest caveats are most of what makes a system feel real instead of toy.

What's next for Creative.AI

We want stronger cold-start visual priors, so the model is sharper on ads with no early-life data. We'd love to validate everything against a real-impression dataset from a partner. The local SmolVLM and Flux adapters are both already designed as drop-in replacements for the runtime AI calls, so on-device inference is mostly a packaging job. And the end-game is a closed loop where the AI rebuild gets served as a real variant, the observed lift feeds back as fresh preference data, and the model keeps getting better the more it's used.