AppLovin Query Planner - Rollup + DuckDB Query Engine

🏆 1,600× Faster than Baseline — 39ms vs 62 seconds on 5 example queries

Inspiration

Ad networks process massive volumes of event data. AppLovin needed instant query execution on 245M rows of ad events (~20GB). Traditional full-scan approaches were too slow; we needed a smarter architecture that could handle arbitrary queries without sacrificing speed.

What it does

A two-tier query engine that executes ad-hoc analytics queries on massive event datasets in <100ms:

11 pre-aggregated rollup tables (Arrow IPC, ~340MB) handle (hopefully!) 80–90% of queries in <10ms via direct lookup
Sorted DuckDB fallback (~2.5GB) handles complex multi-dimensional queries in <400ms
Smart query router analyzes query patterns and routes to the optimal data structure
Supports 20+ concurrent queries with <200ms total execution time

How we built it

Phase 1: Rollup Design

Analyzed baseline query patterns to identify high-impact dimensions (day, country, advertiser, publisher, type)
Pre-aggregated data into 11 rollup tables covering 80–90% of query patterns
Each rollup stores pre-computed aggregates: sum, count, avg, min, max

Phase 2: Build Optimization

Single-pass incremental folding over 49 CSV files (reduces peak memory to ~10GB)
Parallel CSV reading (8 threads) + Polars multi-threading
Arrow IPC format with LZ4 compression (3:1 ratio, instant memory-mapping)

Phase 3: Query Execution

Query router parses dimensions and filters, matches against rollup definitions
Rollup queries: memory-mapped load → filter → aggregate (done)
Fallback queries: DuckDB sorted scans (sorted by week, country, type for efficient range queries)

Tech Stack: Python 3.9+, Polars, DuckDB, Apache Arrow, LZ4 compression

Challenges we ran into

Memory explosion during build → Solved with incremental folding (merge accumulators every 50 batches)
Build time >15 min on M3 Air → Attempted multiprocessing but file-level parallelism added merge overhead; reverted to sequential single-pass (more efficient)
Choosing which dimensions to pre-aggregate → Analyzed baseline patterns; 11 rollups cover 80–90% with diminishing returns beyond
Query routing accuracy → Built intelligent matcher that handles partial dimension sets and falls back to DuckDB safely
Storage format trade-offs → Arrow IPC (memory-mappable, instant zero-copy reads) vs Parquet; chose IPC for 5–10× faster load

Accomplishments that we're proud of

✅ 1,600× speedup (39ms vs 62s) on example queries
✅ Sub-100ms query execution on 245M rows
✅ Compact storage (~3GB rollups + fallback vs ~20GB raw)
✅ Incremental folding keeps peak memory at ~10GB (safe on 12–16GB systems)
✅ Intelligent routing — no manual query tuning needed
✅ Validated against baseline — results are pixel-perfect identical

What we learned

Pre-aggregation is powerful but requires careful dimension selection (diminishing returns beyond 11 rollups)
Sequential single-pass is faster than parallelism for merge-heavy workloads (join overhead outweighs parallelism gains)
Physical layout beats indexes — sorting DuckDB by (week, country, type) eliminates need for indexes, helps massively in reducing build time
Memory-mapped I/O matters — Arrow IPC's zero-copy semantics cut load time from 2–3s to 450ms
Constraints drive design — 16GB RAM limit forced us toward incremental folding and bounded memory strategies

What's next for AppLovin Query Planner

Dynamic rollup generation — auto-detect optimal rollup dimensions from query workloads
Adaptive routing — learn query patterns over time and reorganize rollups on-the-fly
Distributed execution — scale to multi-machine clusters for even larger datasets
Real-time updates — stream new events into rollups without rebuilding
Query optimizer — cost-based planner to choose between rollup vs fallback more intelligently

Quick Start

# Install dependencies
pip install -r requirements.txt

# Build rollups and DuckDB fallback (~11–15 min on M3 Air)
python3 prepare.py --data-dir ./data

# Run example queries (~40ms)
python3 run.py

# Or run your own 20 queries (JSON array format)
python3 run.py --query-file ./your_queries.json

Built With

bash
duckdb
polars
pyarrow
python

Updates

George Ishaq started this project — Oct 26, 2025 12:25 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.