We chose to work on a crowd-sourced tool for data labeling, to simplify the data pipeline for deep learning models.
Nowadays we have amazing tools at our disposal, like systems for language translation, or movies recommendation, computer vision and autonomous driving, and much more! But all these systems are very data-hungry, and most of the time of a Deep Learning engineer can be spent on data collection and preparation. With the right tools, like ours, this time can be reduced.
But not only: lots of data is not enough, we need to ensure that the quality of the data is high. From this perspective, leveraging an open-source crowd-sourced platform is a unique opportunity to achieve a gold standard in data quality and to work towards building an automated system that is more fair, inclusive, and resilient to bias.
Thus, Crowd-Sourced is a free platform that lets users worldwide contribute to labeling textual and graphical data for the intention of data for machine learning and as an open-source way of collecting human-labeled, gold-standard data for machine learning models.
This project comes with basic UI support, and easy to use interface, a set of ready to use interactions, and most importantly, all of the content provided with ease for the user to label so that you can have whatever data for labeling you like in the database without worrying about how the data will be served and labeled.
What it does
It's a platform that lets users help in labeling data sets really quickly, either through graphical or textual formats, by letting them a window where they can see different ways of showcasing the different ways of inputting the labels. The final labels are decided then on the basis of a consensus from the majority of the votes. In short, it enables an easy and quick way for the labeling of the dataset, just like Google Captcha does with its image labeling feature.
The web application is built by keeping the following aspects in mind,
🎁 Modern – Project created using the latest features of React (State management using Hooks)
💻 Responsive – Highly responsive and reusable UI components, that change depending on the provided props, since the UI library used here will be Material UI, which provides responsive components out of the box already
🚀 Fast – Buttery smooth experience thanks to the lightweight implementation of best practices in ReactJS
⚙️ Maintenance - The project is built with Docker Compose, following the easiness of adding and removing services, with easy to add code for maintainability purposes
How we built it
This section lists down the technologies which were used in the making of this awesome project! They are as following,
- Makefile ❤️ scripts for automating many of the processes
- Black (formatting) ❤️ Flake8 (linting)
- FlaskAPI ❤️ MongoEngine (ORM) ❤️ VirtualEnv
- Reacts ❤️ Material UI ❤️ yarn
- EsLint with React to make sure no bugs arose
- GitHub ❤️ with the issue and a pull request template
- MongoDB as the database used
- Docker/Docker Compose
- Linux ❤️ wget ❤️ zip for automating dataset generation and setup
- Python ❤️ requests lib, for using API from Unsplash
Challenges we ran into
Quite a lot,
- Deciding on the details of the workflow
- Deciding on the technology, and making it easy for everyone to follow along with all the issues and the work needed to be done
- Deciding on the UI, and the whole team keeping on par with the quick learning curve and idea
- Trying to convert the idea into the most MVP like as much as possible
- Maintaining best practices with branches, Github issues & PRs
- Making sure everyone was on the same page
- Dealing with hidden bugs with Docker Compose, MongoDB, the server especially
- Having to deploy the frontend somewhere
- Linting, formatting, to make sure the code quality was high
- Simplifying many processes by using a Makefile
Accomplishments that we're proud of
We're proud of a couple of things
- Very rapid development
- Rapid learning and understanding of the solution
- Quickly adapting to a workflow
- Not getting overwhelmed with a sense of feeling we won't make it
- Including so many technologies, stacks, and overall ideas to get this MVP out there
- Discussing very frequently and keeping in touch with everyone to make sure good progress is made
- An issue/PR/branch system of GitHub. As of now, we have 6 closed issues, 2 open issues, and 16 closed PRs, with a total of 70 commits
- Overall, having fun!
What we learned
A lot, for all of us,
- On technical expertise, JS, Python, Automating, Bash Scripting, Dataset generation, API calls, Makefile
- From a people perspective, time zone communication, deciding on a solid single idea and building on top of that
What's next for CrowdSource
Expanding a bit more on the idea, letting users upload their own data, letting 3rd party websites use this functionality in their own web applications, a bit like google captcha. We might go on to make a personal profile for each user, with proper authentication and everything, for each user to upload his/her own personal datasets to be labeled, and then easily introducing a more expansive platform with deals with audio and other formats of data sets well.