This project was inspired by the real constraints at Smadex: making predictions in just a few milliseconds while dealing with massive datasets, sparse user behavior signals, and highly imbalanced revenue distributions. The challenge was to design something simple, fast, and accurate—a model that could realistically run in production.
🛠️ How we Built It
We built the solution around a FAISS-powered approximate KNN model designed for high-speed retrieval. The workflow includes:
- Lightweight preprocessing (target encoding, numeric flags, scaling)
- FAISS IVF indexing for fast neighbor search
- A two-headed prediction formula:
- P(buyer) from neighbor buyer ratios
- E(revenue | buyer) from neighbor buyer-only revenues
- Zero-aware logic to avoid diluting predictions with non-buyers
- Streaming, low-memory processing of test data
The full pipeline is optimized for speed, simplicity, and interpretability.
🎓 What we Learned
- How much value there is in neighbor-based models when engineered well
- The tradeoffs between accuracy and inference latency
- How critical feature preprocessing is for high-dimensional KNN
- How FAISS indexing can turn a slow idea into a production-grade approach
- That simple models can outperform complex ones when paired with the right constraints
🚧 Challenges
- Handling large-scale parquet datasets without blowing up memory
- Extracting and normalizing deeply nested dictionary fields
- Ensuring the scaled features aligned perfectly across train and test
- Balancing K, nlist, and nprobe for the best speed/accuracy tradeoff
- Implementing a streaming prediction pipeline that stays efficient at scale
Log in or sign up for Devpost to join the conversation.