What it does

It takes files of different formats and tries its best to see if they contain some really sensitive data.

How we built it

We carefully inspected the training set to detect any correlations that might be useful to speed up the performance. We approached each file type differently. Generally, we found rigid data structures such as .xml or .csv much easier to handle because of their format.

Challenges we ran into

Compute-intensive tasks such as text recognition from images or speech recognition from mp3 were slowing us down too much. For mp3, we had to resort to metadata, while also taking into account that there are not so many audio files with spoken IBANs in real life. For images, we tried to classify if the image could be a document (a photo is less likely to contain sensitive information).

Accomplishments that we're proud of

We worked in a team, were able to parallelise tasks and learned a lot about file formats, riddles and privacy!

What we learned

  • Biased classifiers are good for some applications!
  • It's better to review than to miss.
  • Relying on proxies such as image background, mp3 metadata or file structure is generally considered to be spurious correlations and just bad taste. However, for some applications, where the cost of a false "Review" is not too high, such practices could be successfully utilised by practitioners.

What's next for Julius Bare - Scan the Bank workshop

An interesting next step would be to account for the cost of "Review"-flagged examples, since the examples have to be reviewed by a human and this takes time. Another interesting step would be utilising text classifiers based on compressed versions of the data. That would allow us to preload machine learning models which stay well within the 5 GB.

Built With

Share this project:

Updates