Many schools are taking advantage of cloud services such as Google Drive to enable students to collaborate on homework assignments and projects. The problem is, given the multitude of different subjects students take at all education levels, it becomes tedious to either manually sort all the files or sift through hundreds of files trying to find a specific one. With this app, users can sort (or resort) all their files in seconds using a machine learning algorithm. As users fix the algorithm's mistakes, we supply this information as training data to improve the accuracy of the algorithm and enable it to label folders automagically - yes, magically :P.
What it does
We built an iOS app where the user signs in with their Google account and supplies how many categories or subjects they believe the files should be sorted into. The app then uses the Google Drive API to find all documents and read their contents. Then, we process each document with nltk and using vector analysis (K Means clustering) we cluster the documents into groups before finally copying each file into a new sorted folder.
How we built it
The front-end is an iOS app that sends requests to our Google Cloud Functions backend which parses the request to sort the data as the user wants. We uses a NodeJS cloud function to receive all of the requests from the app, then sends the appropriate sub-request to other methods that use NodeJS and Python (see below for why).
Challenges we ran into
We planned on doing everything other than language processing with NodeJS, but we quickly ran into an issue with our quota with the Google Drive API. When we tried to read all the files from the user's drive, it kept failing after a certain number of files because we were sending requests too fast. The solution to this was to do batch requests, but we found that it was a lot easier to do this in Python than NodeJS, so we had to switch languages for more than we expected.
Accomplishments that we're proud of
We're most proud that we were able to batch our requests (by using the client library for Python) so we didn't hit the rate limit, which seemed really daunting at first. We're also proud that we were able to implement, for the most part, everything we had in our original idea, without having to compromise due to time. Additionally, we were happy the results we have obtained using solely unsupervised clustering found meaningful patterns between documents; proving the viability of this project given ample training data and supervised clustering (for even better results).
What we learned
The biggest thing we learned is the importance of optimizing our requests so that we don't reach our quota every time we try to run the app. We also learned how to utilize many of the services offered by the Google Cloud platform, including using cloud functions, cloud storage, Firebase, Google Service APIs, and having different cloud functions interact with each other RESTfully.
What's next for AutoSort.it
- Automatically labeling folders using the data generated by users
- Improve sorting using data generated by users (feedback loop mentioned in intro)
- Dropbox & Office Online integration
- (Possible) Image support via Computer Vision
- Allow for adding additional sort parameters (ie. date ranges, etc.)
- Web-app to extend functionality offered in the app