Inspiration
We started with a simple question: what if ad platforms didn’t just track outcomes, but actually understood the creative itself? AppLovin’s Axon optimises in real time, yet the “what’s inside the ad” signal is often shallow—logos, a few labels, maybe some OCR. We wanted to turn every image and video into a rich, consistent feature vector that a recommender can trust: emotion, layout quality, motion/face dynamics, brand cues, text hierarchy, and alignment between visuals and copy.
What it does
AdMind ingests an image or video and outputs a single, schema-stable feature record plus an executive “Insight”:
Core signals: dominant/final emotion, faces per second, object mix, color palette, layout balance, OCR text length/excerpts, brand cues from text, CLIP visual–text alignment, simple motion proxies (video).
Safety & quality: NSFW safety summary and a heuristic Creative Score combining balance, faces, text presence, object richness, alignment, branding, and safety.
Insight layer: a compact JSON summary with emotion, weaknesses, and 3–5 actionable suggestions (via Gemini with deterministic fallback).
Interfaces: a web UI for single creatives and a batch runner that processes full archives into CSV/JSONL in minutes.
How we built it
ML engine (Python/Flask): OpenCV for video sampling, YOLOv8 for objects/faces, FER for emotions, EasyOCR for text, SentenceTransformers (CLIP) for alignment, saliency heatmaps, and brand heuristics from OCR.
Video emotions: 1 FPS sampling → face crops → per-second FER → exponential moving average → final emotion distribution and faces/sec.
Backend (Node/Express): Media upload, ML proxying, Gemini integration with model fallbacks, safety settings, health checks, and strict JSON parsing.
Frontend (React): “AdMind Analyzer” with previews, color palettes, detected objects/brands, heatmaps, timelines, and parsed Insights.
Batch tooling: Multithreaded batch_run.py with retries, timeouts, and timing summaries. Outputs features.csv and features.jsonl.
Challenges we ran into
Heterogeneous media: Inconsistent codecs, FPS, and aspect ratios required resilient decoding and capped sampling.
Noisy OCR: Stylized fonts and overlays demanded post-processing and token budget management for LLM prompts.
Temporal noise in emotions: Per-frame FER is jittery; EMA smoothing and face-crop prioritization stabilized results.
API/version drift: Different Gemini SDKs and safety flags required layered fallbacks to guarantee an insight for every asset.
Feature orthogonality: We iterated to reduce correlation between “text density,” “alignment,” and “brand cues.”
Accomplishments that we're proud of
A schema-stable feature set that’s immediately usable in ranking models.
Under-5-minute batch processing on the provided dataset with parallelism and retries.
Deterministic insight fallback, so every creative yields an executive summary even if the LLM is unavailable.
A clean, usable UI that makes model outputs interpretable to marketers and engineers alike.
What we learned
Thoughtful feature design beats model size for downstream performance.
Temporal aggregation (EMAs, per-second summaries) is essential for robust video understanding.
Strict JSON prompts and prompt compaction prevent LLM flakiness and keep UI parsers happy.
Operational details—health checks, timeouts, safety settings—are as important as the models.
What's next for AdMind
Audio features: speech detection, music energy, tempo, and prosody for first-second attention prediction.
CTA & layout semantics: button/price/tagline detectors; text hierarchy scoring.
Logo recognition & brand embeddings: learned brand vectors for similarity search and compliance.
Motion/scene dynamics: cut detection, motion intensity curves, and “hook” quality in the first second.
Online learning hooks: log features with outcomes to learn causal weights and close the creative feedback loop.
Built With
- express.js
- gemini
- keras
- node.js
- python
- react
- tensorflow


Log in or sign up for Devpost to join the conversation.