Smadex Challenge: PPMM

Inspiration

We were inspired by how close this challenge is to a real DSP problem: you only have a few milliseconds to decide whether to bid for an impression, most installs will never buy anything, and a tiny fraction of users generate almost all of the in-app revenue. That zero-inflated, heavy-tailed setting is very different from a textbook regression task. We wanted to build something that does not just look good on a leaderboard, but could realistically help decide where to spend money by spotting high-value in-app buyers as early as possible.

What it does

Our solution predicts the 7-day in-app purchase revenue per install (iap_revenue_d7) and, more importantly, ranks users by their future value.

Concretely:

Given features from the install request and user behaviour history,
A single LightGBM regressor outputs a score for expected 7-day revenue, trained on a log-transformed target
This score can be used to:
- estimate revenue for reporting and analytics, and
- rank installs so the system can prioritise which users are worth higher bids or special treatment.

On our validation split, the model underestimates total revenue, but it concentrates about 72.6% of real in-app revenue in the top 10% of users it scores highest, which is exactly what matters for budget allocation.

How we built it

We started by designing a data pipeline that could handle the scale and format of the dataset:

We used Dask to read Parquet files in chunks, instead of loading everything into memory at once.
We initially worked on a sample of the data to iterate faster on features and models.
We created a train and validation split to evaluate offline, independent of the leaderboard.

For preprocessing, we wrapped everything in a single ColumnTransformer:

Numeric features
- Imputed missing values with the median.
Categorical features
- Imputed with the most frequent value.
- Encoded with an OrdinalEncoder, mapping unseen categories at inference time to minus one.

This gave us a reusable preprocessor object so that train, validation and test all go through exactly the same transformations.

Because the target revenue is extremely skewed and the leaderboard metric is MSLE, we trained on the log-transformed target

On top of this pipeline, we explored three modelling ideas:

Hurdle model in two steps
- First, classify buyer versus non-buyer (buyer_d7 or revenue greater than zero).
- Then, regress revenue only for buyers.
- Combine (P(\text{buyer})) and (\mathbb{E}[\text{revenue} \mid \text{buyer}]).
- Conceptually clean for zero-inflated data, but doubles complexity (two models, two tunings) and did not clearly outperform a strong single model.
DNN with embeddings
- Learn embeddings for high-cardinality categorical features such as device or country.
- Concatenate embeddings and numeric features and feed them into a deep network.
- More expressive, but harder to tune within the datathon, and heavier at inference.
Single LightGBM regressor (final choice)
- A single LightGBM model predicting log(1 + revenue) from a compact set of features.
- Strong performance on tabular data, very fast inference, simple to deploy.

After comparing performance and complexity, we chose the single LightGBM as our final solution.

Challenges we ran into

We faced several challenges:

Zero-inflated, noisy target
Most installs have iap_revenue_d7 equal to zero, and a tiny number have large values. This makes classical regression metrics such as R squared look pessimistic and pushes the model to be conservative. On validation, we got:
- MSLE approximately 0.175
- RMSLE approximately 0.430
- R squared approximately 0.0006
- Real revenue about 397k dollars, predicted about 38k dollars (the model underestimates total revenue).
Scale and memory
Working with large Parquet files required careful handling (Dask, sampling, and a lean feature set) to avoid out-of-memory issues.
Balancing ambition and constraints
The hurdle model and the DNN with embeddings were attractive from a modelling perspective, but:
- Increased complexity and tuning effort.
- Made it harder to guarantee millisecond-level inference.
- Did not clearly outperform a well-tuned LightGBM in our setup.

The main challenge was deciding when to stop adding complexity and focus on a solution that is simple, robust and fast.

Accomplishments that we are proud of

We are proud of several aspects of the project:

We built a clean end-to-end pipeline: Dask ingestion, consistent preprocessing with ColumnTransformer, a single LightGBM model, and an inference loop that processes Parquet test files in batches and writes predictions to submission format.
Even though the model underestimates total revenue, as a ranking model it performs very well:
- AUC for buyers versus non-buyers: 0.885
- Average Precision (PR AUC): 0.230
- Top 10% of users by predicted revenue contain 72.6% of the real revenue.
- Top 5% contain 55.9% of the real revenue.
- Buyer rate: 3.1% overall, 17.7% in the top 10%, 23.2% in the top 5%.

These numbers show that, despite being conservative on absolute values, the model is very good at finding and prioritising the high-value in-app buyers, which is exactly what matters for bidding and budgeting.

What we learned

We learned that in this kind of problem, ranking and value concentration can be more important than predicting exact revenue for every user. A model that minimises MSLE but treats all users similarly would not be very useful; a model that puts most of the real revenue into the top K scored users is.

We also learned how important it is to align the loss and the analysis with the business goal: training on log(1 + y) and optimising MSLE made the model more stable, but we needed additional metrics (AUC for buyers, top K revenue curves) to see whether we were actually helping the business. Finally, we saw that a simple, well-engineered LightGBM pipeline can beat more complex architectures in terms of practicality when you care about latency, maintainability and development speed.

What's next for Smadex Challenge: Predict the Revenue

If we had more time, we would like to:

Revisit the hurdle model idea, but using LightGBM in both stages, to explicitly model:
- (P(\text{buyer} \mid \text{install})) and
- (\mathbb{E}[\text{revenue} \mid \text{buyer}, \text{install}]), while still keeping inference fast and simple.
Incorporate richer user history features (for example recency-weighted statistics and simple sequence summaries) and measure how much extra lift we get in top K revenue, under strict latency limits.
Explore custom losses or objectives that directly reward good ranking in the high-value tail (for example losses emphasising the top decile of revenue), and compare them with MSLE.
Work on calibration and post-processing so that predicted revenue is not only useful for ranking, but also closer to the real scale, making it easier to plug into bidding and budgeting strategies in a production system.