AppLovin Query Planner - Rollup + DuckDB Query Engine

🏆 1,600× Faster than Baseline — 39ms vs 62 seconds on 5 example queries

Inspiration

Ad networks process massive volumes of event data. AppLovin needed instant query execution on 245M rows of ad events (~20GB). Traditional full-scan approaches were too slow; we needed a smarter architecture that could handle arbitrary queries without sacrificing speed.

What it does

A two-tier query engine that executes ad-hoc analytics queries on massive event datasets in <100ms:

  • 11 pre-aggregated rollup tables (Arrow IPC, ~340MB) handle (hopefully!) 80–90% of queries in <10ms via direct lookup
  • Sorted DuckDB fallback (~2.5GB) handles complex multi-dimensional queries in <400ms
  • Smart query router analyzes query patterns and routes to the optimal data structure
  • Supports 20+ concurrent queries with <200ms total execution time

How we built it

Phase 1: Rollup Design

  • Analyzed baseline query patterns to identify high-impact dimensions (day, country, advertiser, publisher, type)
  • Pre-aggregated data into 11 rollup tables covering 80–90% of query patterns
  • Each rollup stores pre-computed aggregates: sum, count, avg, min, max

Phase 2: Build Optimization

  • Single-pass incremental folding over 49 CSV files (reduces peak memory to ~10GB)
  • Parallel CSV reading (8 threads) + Polars multi-threading
  • Arrow IPC format with LZ4 compression (3:1 ratio, instant memory-mapping)

Phase 3: Query Execution

  • Query router parses dimensions and filters, matches against rollup definitions
  • Rollup queries: memory-mapped load → filter → aggregate (done)
  • Fallback queries: DuckDB sorted scans (sorted by week, country, type for efficient range queries)

Tech Stack: Python 3.9+, Polars, DuckDB, Apache Arrow, LZ4 compression

Challenges we ran into

  1. Memory explosion during build → Solved with incremental folding (merge accumulators every 50 batches)
  2. Build time >15 min on M3 Air → Attempted multiprocessing but file-level parallelism added merge overhead; reverted to sequential single-pass (more efficient)
  3. Choosing which dimensions to pre-aggregate → Analyzed baseline patterns; 11 rollups cover 80–90% with diminishing returns beyond
  4. Query routing accuracy → Built intelligent matcher that handles partial dimension sets and falls back to DuckDB safely
  5. Storage format trade-offs → Arrow IPC (memory-mappable, instant zero-copy reads) vs Parquet; chose IPC for 5–10× faster load

Accomplishments that we're proud of

1,600× speedup (39ms vs 62s) on example queries
Sub-100ms query execution on 245M rows
Compact storage (~3GB rollups + fallback vs ~20GB raw)
Incremental folding keeps peak memory at ~10GB (safe on 12–16GB systems)
Intelligent routing — no manual query tuning needed
Validated against baseline — results are pixel-perfect identical

What we learned

  • Pre-aggregation is powerful but requires careful dimension selection (diminishing returns beyond 11 rollups)
  • Sequential single-pass is faster than parallelism for merge-heavy workloads (join overhead outweighs parallelism gains)
  • Physical layout beats indexes — sorting DuckDB by (week, country, type) eliminates need for indexes, helps massively in reducing build time
  • Memory-mapped I/O matters — Arrow IPC's zero-copy semantics cut load time from 2–3s to 450ms
  • Constraints drive design — 16GB RAM limit forced us toward incremental folding and bounded memory strategies

What's next for AppLovin Query Planner

  • Dynamic rollup generation — auto-detect optimal rollup dimensions from query workloads
  • Adaptive routing — learn query patterns over time and reorganize rollups on-the-fly
  • Distributed execution — scale to multi-machine clusters for even larger datasets
  • Real-time updates — stream new events into rollups without rebuilding
  • Query optimizer — cost-based planner to choose between rollup vs fallback more intelligently

Quick Start

# Install dependencies
pip install -r requirements.txt

# Build rollups and DuckDB fallback (~11–15 min on M3 Air)
python3 prepare.py --data-dir ./data

# Run example queries (~40ms)
python3 run.py

# Or run your own 20 queries (JSON array format)
python3 run.py --query-file ./your_queries.json

Built With

Share this project:

Updates