ScrubData - Automated Data Cleaning

Inspiration

The inspiration for this project came from our experience as data scientists and how tedious cleaning data can be. Due to how time-consuming data cleaning can be, we decided to create a project where we can upload a .CSV file and return it with all unnecessary data cleaned. This project will help many industries, such as finance, healthcare, and e-commerce, as there is a significant demand for tools that can help streamline the process of cleaning data. By utilizing Databricks, Amazon S3, and a custom web application, we plan to create a platform for data cleaning that can solve this issue. What it does Our application allows users to upload raw data in CSV format, select specific cleaning options (such as removing NA values and handling outliers), and submit it for processing. The uploaded file is stored in Amazon S3, and a Databricks job is triggered to clean the data based on the selected options. Once processed, the cleaned file is stored back in S3, and the user receives a download link for the cleaned dataset.

How we built it

We built the project using a full-stack web application framework: Frontend HTML - CSS - JavaScript Backend: Node.js - Express - Amazon S3 - Databricks Databricks: Processes the data by reading the file from S3, applying cleaning operations as specified, and saving the cleaned file back to S3. Amazon S3: S3 stores both the raw uploaded files and the processed, cleaned files.

Challenges we ran into

We encountered several challenges throughout this project: Setting up AWS: Our first time using AWS resulted in us needing to watch and read a lot about how to create buckets and interact with them. Connecting Databricks with S3: Ensuring Databricks could connect to S3 was vital, using this environment was also new and we had to make it so that databricks could process several different options. Front-End - Back-End: Connecting the 2 proved to be quite difficult, as we had to figure out how to accept files, pass them to the cleaning, and then back to the front end. Accomplishments that we're proud of We're proud to have developed a fully functional data cleaning pipeline that integrates multiple technologies effectively. The application enables users with minimal technical background to clean large datasets without complex setup. Successfully orchestrating AWS and Databricks with a responsive frontend and robust backend is a significant achievement for our team.

What we learned

Through this project, we gained valuable experience in: Cloud Storage: We cultivated our understanding of Amazon S3 specializing in file storage. Databricks Workflows: We learned how to set up and manage Databricks jobs, parameterize notebooks, and handle data transformation at scale. Full-Stack Development: We strengthened our understanding of how front-end can interact with back-end logic and external APIs.

What's next for Data Cleaning Using Databricks

In the future, we plan to expand this project by adding: Additional Data Cleaning Options: Including additional data cleaning tasks, such as standardization, normalization, etc. User Authentication and History: Creating user authenticator/login and a history management system of files and results. Visualization of Cleaning Results: Creating functionality that shows users how data has been changed, cleaned, improved.

Built With

Updates

Manovay Sharma started this project — Nov 09, 2024 05:49 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.