Datasets, data lakes, data repositories; everyone loves data these days! But no one really likes to work with unorganized and messy data. Thus, we tasked ourselves to create a tool that will analyze data lakes and data repositories to help quantify and improve the cleanliness of a data ecosystem!
A command line tool for large dataset analysis using clustering of the file data. The tool will use file metadata to determine text and tabular data, and then analyze the contexts of those documents using NLP. We will then give you a rating of how clean your data repository is based on similarity and heirarchy metrics. In addition, this pipeline will propose an organization structure that more intuitively groups like files together.
How we built it
This tool was built with love in python and made as a command line tool using argparse.
The journey wasn't without some bumps. The development was full of issues with integrating different packages. The theory behnd the "cleanliness score" also took quite a bit of thought.
And thus we present to you, a tool for data scientists! This tool will analyze how disorganized your data really is by examining dataset content and file hierarchy. Then, by proposing clusters of files, it can also provide an intuition for a better file organization.
main.py runs the entire clustering and cleanliness evaluation pipeline. All the possible arguments are in options.py. It does not modify, rename, or delete any files in the source dataset. However, it does make copies of some files and place them in a separate directory in order to convert data to a single format. It has the ability to analyze the file extension composition of the dataset, preprocess data for clustering, cluster all tabular and text data, print the results and cluster distributions to a .pdf, and compute an estimate of the cleanliness of the dataset.
python3 main.py --dataset_path /u/jzhang56/HackMIT/Archive --cluster_struct n --cluster_text y --minibatch_kmeans y --num_clusters_start 8 --num_clusters_end 18 --overwrite_tokens_text y --overwrite_clusters_text y