This project was inspired by the real constraints at Smadex: making predictions in just a few milliseconds while dealing with massive datasets, sparse user behavior signals, and highly imbalanced revenue distributions. The challenge was to design something simple, fast, and accurate—a model that could realistically run in production.

🛠️ How we Built It

We built the solution around a FAISS-powered approximate KNN model designed for high-speed retrieval. The workflow includes:

  • Lightweight preprocessing (target encoding, numeric flags, scaling)
  • FAISS IVF indexing for fast neighbor search
  • A two-headed prediction formula:
    • P(buyer) from neighbor buyer ratios
    • E(revenue | buyer) from neighbor buyer-only revenues
  • Zero-aware logic to avoid diluting predictions with non-buyers
  • Streaming, low-memory processing of test data

The full pipeline is optimized for speed, simplicity, and interpretability.

🎓 What we Learned

  • How much value there is in neighbor-based models when engineered well
  • The tradeoffs between accuracy and inference latency
  • How critical feature preprocessing is for high-dimensional KNN
  • How FAISS indexing can turn a slow idea into a production-grade approach
  • That simple models can outperform complex ones when paired with the right constraints

🚧 Challenges

  • Handling large-scale parquet datasets without blowing up memory
  • Extracting and normalizing deeply nested dictionary fields
  • Ensuring the scaled features aligned perfectly across train and test
  • Balancing K, nlist, and nprobe for the best speed/accuracy tradeoff
  • Implementing a streaming prediction pipeline that stays efficient at scale

Built With

Share this project:

Updates