Inspiration

Creative teams guess which ads work; we'd much prefer fast, explainable signals across video, image, and audio to pick winners before spend.

What it does

Extracts features from images, videos, and audio.

Merges video+audio per video; outputs JSON with interpret-able metrics (shots, motion, text/CTA timing, product reveal, end-card, thumbnails, faces, color, clutter, platform fit, loop-readiness, audio speech/music).

How we built it

We created a modular pipeline:

image/: visual semantics, attention maps, OCR, optional LLM captioning. videos/: per-metric modules with parallel per-video execution and dependency waves. Sub-tasks are parallelize-able audio/: separation, ASR, speech/music/text analysis.

Orchestrator that runs image, and in parallel runs video and audio, then merges by filename stem.

Challenges we ran into

  • Hitting sub‑second per-metric latency without GPUs. Aggregation can almost entirely be parallelized so the result will only take the longest of each metric, summed if dependencies exist.
  • Avoiding redundant computation across steps (saliency/gray/MSER reuse).

Accomplishments that we're proud of

End‑to‑end, explainable feature set with per‑metric breakdowns, parallel per‑video tasking and A/V concurrency, tweakable speed/quality knobs and weights, simple JSON outputs ready for ranking/ML.

What we learned

  • Lightweight CV proxies (saliency/MSER/flow) are strong baselines.
  • Caching + downsampling matter more than we expected.
  • Clear dependency waves simplify parallelism and debugging.

What's next for AdJacent

We still need to add lightweight logo/face detectors and text OCR keywords for CTAs. For full functionality, we should train a learned ranker on historical performance. We also need to add web UI for previews, breakdowns, and A/B comparisons.

Integrating this with AppLovin would be the ultimate step!

Share this project:

Updates