Privacy policies are long. Really long. In an era of digital distrust, sparse internet governance and opaque corporate data privacy policies, there has never been a better time for consumers to understand exactly what they’re agreeing to.

A 2016 study showed that people will agree to anything to get past a wall of verbose legal terms - a shocking 98% of the participants didn’t realize that the policy for a mock social media site doled out their personal information to the NSA and even required them to surrender their first born child. As consumers, we too understand why people never read Terms and Conditions (after all, who has the time to read Facebook’s 14000-word Privacy Policy...). That’s why we came up with a solution that easily summarises complex legalese with nothing more than a click of a button. When consumers visit their favorite site, they can click the PrivacyX button, which will redirect them to the Privacy Policy page (which they could have otherwise skipped!). This present them with an extractive text summary where each sentence is rated as ‘thumbs up’ or ‘thumbs down’, with key phrases and actions highlighted. We also included complexity scores and readability scores to help consumers understand how reader-friendly the Terms and Conditions are.

What it does

PrivacyX uses Natural Language Processing to summarise wordy Privacy Policies into easily digestible summaries. Click the button on the top right of any page to automatically visit their privacy policy page and a summary with ratings will pop-up. If you navigate to any privacy policy or terms and conditions manually, the pop-up will also be available.

How we built it

The backend server was built and deployed on a cloud virtual machine as a microservice. The inference endpoints utilises novel NLP techniques and models fine-tuned for our use-case. Extractive Summarization of the T&Cs and policies is done using the unsupervised TextRank summarization algorithm to identify the key sentences in the document. We also carry out Text Classification of the document’s sentences, similar to Sentiment Analysis, to recognize the polarity of a given sentence in the policy. This was done using a Transformer model (BERT) with a logistic regression head fine-tuned for our domain-specific task using a labelled dataset. Readability and complexity scores of the document are calculated and assigned using a publicly available API which uses the Flesch–Kincaid Grading method. Finally, we also used a pre-trained Named Entity Recognition model to identify key entities in the document that should be brought to the consumer’s attention such as organisations involved and actions that the policy entails. All these processes are carried out on our backend server exposed through a REST API to communicate to our front-end.

On the client side, a userscript using jQuery was created for use in a manager such as Tampermonkey or Greasemonkey. A button is created in the top right corner of the page which will, when clicked, filter through all the <a> tags on the page to find the link to the privacy policy page. When the page is opened, the script will confirm that the user is on the right page and then send an http GET request to the backend with the page URL. The response from the server will be parsed and displayed to show the summary along with some ratings for the policy through responsive design.

Challenges we ran into

We had no experience with making browser extensions and faced difficulties with compatibility for different browsers so we opted to develop a userscript for the client side software instead. Since its a relatively untouched area, there are limited labelled and relevant datasets specific to our use-case.

Accomplishments that we're proud of

  1. Userscript is able to find the privacy policy page on most websites.
  2. Algorithm is able to summarize quite well even though it was trained on a limited dataset.
  3. DevOps - Smooth collaboration between the frontend and backend dev teams, which allowed us to seamlessly integrate both sides. We used modular programming style to ensure that changes to one team’s code won’t affect the other team’s progress.

What we learned

Before even beginning development, we had an important discussion which ultimately led us in the right direction. By splitting the client and server development and connecting them through REST, we were able to develop simultaneously and learnt it is important to communicate well between the two dev teams.

Being new to hackathons, some of us didn’t have prior experience in software development. We learned best practices and communication between the front and back-end teams. We often picked new concepts up on the fly, such as using NLP packages like nltk. Through the pre-hackathon workshop hosted by Stripe and Microsoft, we learned valuable skills, such as how to use the Javascript Fetch API and Azure’s ML packages respectively.

What's next for PrivacyX

We hope to be able to improve on the summarization and scoring as well as generate a more detailed dashboard which may help users to better visualize what happens to their data. Given more time, we could take this project one step further by creating our own labeled dataset, which would be crucial for creating a more robust model. With such a dataset, we would do Abstractive Text Summarisation, which can accurately paraphrase complex legalese into easy-to-read sentences. Making use of more advanced NLP techniques could refine our summaries, using commonly-used lexicon and even colloquialisms to make our summaries very user-friendly.

We can also devise a scoring model with more parameters that would allow us to grade various Privacy Policies in a more detailed manner.

Share this project: