About the Project

This project was inspired by the growing need to process and analyze massive datasets without the overhead of managing infrastructure. Traditional ML pipelines often require moving large volumes of data to compute environments, which increases cost, latency, and complexity. We wanted to explore whether serverless technologies could handle end-to-end machine learning workflows—efficiently and scalably—directly within the data lake.

What We Built

We developed a serverless ML pipeline for clustering large-scale datasets using AWS Athena and AWS Lambda:

  • Distance Calculation & Filtering (Athena):
    Using SQL queries in Athena, we compute pairwise distances and apply filters to reduce the dataset size. This step is crucial for minimizing data transfer and preparing a manageable input for clustering.

  • Clustering (Lambda + scikit-learn):
    Pre-filtered data is passed to a Lambda function that runs a clustering algorithm from scikit-learn. This approach allows us to leverage Lambda's scalability while staying within its memory and execution time constraints.

What We Learned

  • Serverless tools like Athena and Lambda are surprisingly powerful when used together for ML preprocessing and execution.
  • Pushing computation to the data lake using SQL (Athena) greatly reduces the need for data movement.
  • Designing ML pipelines for Lambda requires careful attention to memory usage and time limits, but can work well for bounded tasks.

Challenges We Faced

  • Data volume: Processing billions of rows meant optimizing queries for performance and cost.
  • Lambda constraints: We had to fine-tune the clustering code to fit within Lambda’s execution limits, including memory and timeout restrictions.
  • scikit-learn packaging: We used the imperva/aws-lambda-layer project to create a Lambda Layer for scikit-learn and its dependencies, since Lambda's default environment does not provide them.

Accomplishments that we're proud of

  • Pipelines in production - Successfully deployed the serverless ML pipeline in production, processing large datasets efficiently and running on a daily basis.

What's next for Serverless ML clustering pipeline

  • Generalizing - Adding more pipelines and generalizing the framework to support different clustering algorithms, inputs and data types.

Built With

Share this project:

Updates