One of the well known challenges in supervised learning is dataset collection and annotation. There are already a plethora of tools to annotate notably images/videos for computer vision algorithms. This project aims at pushing this idea further by enabling data scientists to collect usable anonymised data from end-users or to submit a dataset for cleaning or annotation tasks (without revealing the whole dataset).
What it does
A data scientist wants to collect data (records or images for now). She specifies the format of the data that she wants back and publishes an offer. Willing contributors have the data that she wants, so they "upload" it to the platform(the data does not leave premises until it is anonymized).
A data scientist has a unanotated raw dataset that she wants to use. She specifies the task to be done and if the data has to be prepared before review and publishes an offer. The data may be anonymised if it sensitive, and it is then shuffled and dispersed in chunks to wiling contributors who then will execute the task asked. To guarantee that the review process is trustable(as well as to guarantee that the task gets done), a record may be sent to many reviewers, who will then have to reach a consensus.
How I built it
Using principles from Decentralized systems for anonymous communication, consensus and traceability, from privacy enhancing technology(secure multiparty communications and/or homomorphic encryption) and from software security.
Challenges I ran into
- Making sure that collected data is anonymous
- Making sure that attackers cannot obtain sensitive data
- Making sure that attackers cannot influence the dataset
- Making sure that the data recipient can use the data without learning too much about it