Privacy policies are long. Really long. In an era of digital distrust, sparse internet governance and opaque corporate data privacy policies, there has never been a better time for consumers to understand exactly what they’re agreeing to.
What it does
How we built it
The backend server was built and deployed on a cloud virtual machine as a microservice. The inference endpoints utilises novel NLP techniques and models fine-tuned for our use-case. Extractive Summarization of the T&Cs and policies is done using the unsupervised TextRank summarization algorithm to identify the key sentences in the document. We also carry out Text Classification of the document’s sentences, similar to Sentiment Analysis, to recognize the polarity of a given sentence in the policy. This was done using a Transformer model (BERT) with a logistic regression head fine-tuned for our domain-specific task using a labelled dataset. Readability and complexity scores of the document are calculated and assigned using a publicly available API which uses the Flesch–Kincaid Grading method. Finally, we also used a pre-trained Named Entity Recognition model to identify key entities in the document that should be brought to the consumer’s attention such as organisations involved and actions that the policy entails. All these processes are carried out on our backend server exposed through a REST API to communicate to our front-end.
On the client side, a userscript using jQuery was created for use in a manager such as Tampermonkey or Greasemonkey. A button is created in the top right corner of the page which will, when clicked, filter through all the
GET request to the backend with the page URL. The response from the server will be parsed and displayed to show the summary along with some ratings for the policy through responsive design.
Challenges we ran into
We had no experience with making browser extensions and faced difficulties with compatibility for different browsers so we opted to develop a userscript for the client side software instead. Since its a relatively untouched area, there are limited labelled and relevant datasets specific to our use-case.
Accomplishments that we're proud of
- Algorithm is able to summarize quite well even though it was trained on a limited dataset.
- DevOps - Smooth collaboration between the frontend and backend dev teams, which allowed us to seamlessly integrate both sides. We used modular programming style to ensure that changes to one team’s code won’t affect the other team’s progress.
What we learned
Before even beginning development, we had an important discussion which ultimately led us in the right direction. By splitting the client and server development and connecting them through REST, we were able to develop simultaneously and learnt it is important to communicate well between the two dev teams.
Being new to hackathons, some of us didn’t have prior experience in software development. We learned best practices and communication between the front and back-end teams. We often picked new concepts up on the fly, such as using NLP packages like
What's next for PrivacyX
We hope to be able to improve on the summarization and scoring as well as generate a more detailed dashboard which may help users to better visualize what happens to their data. Given more time, we could take this project one step further by creating our own labeled dataset, which would be crucial for creating a more robust model. With such a dataset, we would do Abstractive Text Summarisation, which can accurately paraphrase complex legalese into easy-to-read sentences. Making use of more advanced NLP techniques could refine our summaries, using commonly-used lexicon and even colloquialisms to make our summaries very user-friendly.
We can also devise a scoring model with more parameters that would allow us to grade various Privacy Policies in a more detailed manner.