AppLovin Query Planner - Rollup + DuckDB Query Engine
🏆 1,600× Faster than Baseline — 39ms vs 62 seconds on 5 example queries
Inspiration
Ad networks process massive volumes of event data. AppLovin needed instant query execution on 245M rows of ad events (~20GB). Traditional full-scan approaches were too slow; we needed a smarter architecture that could handle arbitrary queries without sacrificing speed.
What it does
A two-tier query engine that executes ad-hoc analytics queries on massive event datasets in <100ms:
- 11 pre-aggregated rollup tables (Arrow IPC, ~340MB) handle (hopefully!) 80–90% of queries in <10ms via direct lookup
- Sorted DuckDB fallback (~2.5GB) handles complex multi-dimensional queries in <400ms
- Smart query router analyzes query patterns and routes to the optimal data structure
- Supports 20+ concurrent queries with <200ms total execution time
How we built it
Phase 1: Rollup Design
- Analyzed baseline query patterns to identify high-impact dimensions (day, country, advertiser, publisher, type)
- Pre-aggregated data into 11 rollup tables covering 80–90% of query patterns
- Each rollup stores pre-computed aggregates: sum, count, avg, min, max
Phase 2: Build Optimization
- Single-pass incremental folding over 49 CSV files (reduces peak memory to ~10GB)
- Parallel CSV reading (8 threads) + Polars multi-threading
- Arrow IPC format with LZ4 compression (3:1 ratio, instant memory-mapping)
Phase 3: Query Execution
- Query router parses dimensions and filters, matches against rollup definitions
- Rollup queries: memory-mapped load → filter → aggregate (done)
- Fallback queries: DuckDB sorted scans (sorted by week, country, type for efficient range queries)
Tech Stack: Python 3.9+, Polars, DuckDB, Apache Arrow, LZ4 compression
Challenges we ran into
- Memory explosion during build → Solved with incremental folding (merge accumulators every 50 batches)
- Build time >15 min on M3 Air → Attempted multiprocessing but file-level parallelism added merge overhead; reverted to sequential single-pass (more efficient)
- Choosing which dimensions to pre-aggregate → Analyzed baseline patterns; 11 rollups cover 80–90% with diminishing returns beyond
- Query routing accuracy → Built intelligent matcher that handles partial dimension sets and falls back to DuckDB safely
- Storage format trade-offs → Arrow IPC (memory-mappable, instant zero-copy reads) vs Parquet; chose IPC for 5–10× faster load
Accomplishments that we're proud of
✅ 1,600× speedup (39ms vs 62s) on example queries
✅ Sub-100ms query execution on 245M rows
✅ Compact storage (~3GB rollups + fallback vs ~20GB raw)
✅ Incremental folding keeps peak memory at ~10GB (safe on 12–16GB systems)
✅ Intelligent routing — no manual query tuning needed
✅ Validated against baseline — results are pixel-perfect identical
What we learned
- Pre-aggregation is powerful but requires careful dimension selection (diminishing returns beyond 11 rollups)
- Sequential single-pass is faster than parallelism for merge-heavy workloads (join overhead outweighs parallelism gains)
- Physical layout beats indexes — sorting DuckDB by (week, country, type) eliminates need for indexes, helps massively in reducing build time
- Memory-mapped I/O matters — Arrow IPC's zero-copy semantics cut load time from 2–3s to 450ms
- Constraints drive design — 16GB RAM limit forced us toward incremental folding and bounded memory strategies
What's next for AppLovin Query Planner
- Dynamic rollup generation — auto-detect optimal rollup dimensions from query workloads
- Adaptive routing — learn query patterns over time and reorganize rollups on-the-fly
- Distributed execution — scale to multi-machine clusters for even larger datasets
- Real-time updates — stream new events into rollups without rebuilding
- Query optimizer — cost-based planner to choose between rollup vs fallback more intelligently
Quick Start
# Install dependencies
pip install -r requirements.txt
# Build rollups and DuckDB fallback (~11–15 min on M3 Air)
python3 prepare.py --data-dir ./data
# Run example queries (~40ms)
python3 run.py
# Or run your own 20 queries (JSON array format)
python3 run.py --query-file ./your_queries.json

Log in or sign up for Devpost to join the conversation.