Inspiration

Tell me the truth! Have you ever read the terms and agreement of software completely? Probably not. This is why data breaches happen and your data is being sold without even your knowledge. Regardless of what it is that people were doing—making new accounts, downloading software, getting a license, they would just tick the checkbox thoughtlessly, or at most read the first few sentences. And in a second your data goes into the blackness of the dark web.

Take examples -

  1. Apple is notoriously wordy- you will never read 20,000 words before using the iTunes Store.
  2. Facebook breaks down its TERMS & CONDITIONS in multiple pages — we stopped after copying and pasting them in a single 15,000-word document.
  3. Google still has 5000 words but it is another strategy to collect all the stuff on your hard drive. You might even wonder sometimes. How does google know what I'm thinking?. Simply, because they track all of your data for "IMPROVEMENT PURPOSES" lol XD.

So after thorough researching and brainstorming, we made this app Light House has entered the chat

What it does

Light House is an NLP-based website. It has several features that make it a complete app to combat data and privacy infringement. The main feature allows you to enter the link of terms and agreement or even paste it. And our app will summarise it for you to read and understand (scalability => 5000 words converted into nearly 1000 words). It will also judge it upon 4 components and provide you with an overall rating and a Flesch readability score. It has more features like a privacy and data law map. IE when you click on a country it will show you all its laws regarding data and privacy. It also has a database of all previous data breaches which you can browse through. This app is an overall digital privacy companion :)

How we built it

Our project consists of two parts - the client and the server.

  1. The Client is made with HTML/SASS/JavaScript, FontAwesome.

  2. The Server handles all the requests from the client and does the processing. It is created using the Django Framework. We have used SQLite3 for the database.

  3. For NLP, it utilizes these methods through sentence by sentence analysis:

    i) Static Keyword Matching: We have created our list of keywords that can give us vital information about the general user.

    ii) *After that, we have created 5 classes * - Privacy, Copyright, Pricing, Content Sharing and Termination. Now after extracting sentences having the keywords of our corpus, we have allotted them to one of these classes. We have used Gensim and nltk library for this

    iii) Flesch Readability Score: We have calculated the Flesch readability score for the TOS using the t textstat python library. (Graded out of 100)

    iv) Overall Score:After assessing all the above features the NLP averages and determines the overall score of the terms and agreement. (Graded out of 5)

  4. It also tracks some of the website main privacy-related features which are

    • Uses Cookies
    • Uses and shares personal data
    • Stores Data anywhere except its database
    • Website tracks you or not
  5. Policy Law Map - We have used amcharts library to plot an interactive map on which when we click on the boundary of a country, it fetches the policy laws of that country from the database and shows them to the user. 6.Previous Databreach - We have used the API haveibeenpwned to find out whether a particular domain/website had a data breach in the history and if yes, it shows the details of that breach.

All of these features are daisy-chained together to produce a fully-functional web application that users are able to use.

Challenges we ran into

There were several challenges for the team. I'm gonna list them

  1. Time Zone difference- Due to huge timezone differences between teammates we faced a lot of problems especially under policy-making. Cooperating timely was a bit too hard
  2. Training the NLP- As you all know training the NLP is a bit too painful. It was very tedious and we had to create our own dataset by searching like over 20+ different terms and agreements.
  3. Cooperating as a team

Accomplishments that we're proud of

  1. Accurate Summarization - We were able to summarize and extract important data.
  2. International Laws Map - We worked with Apis and interactive Charts for the first time and we learnt a lot.
  3. Main goal- We believe we were able to at least accomplish our main goal to some extent by summarizing the big documents and getting useful insights and making general users aware of data privacy.
  4. NLP Techniques - We also learnt new NLP techniques which we have applied in this project to extract the important sentences

What we learned

1.Understanding how to use NLP techniques to extract important information from a vast amount of data and how to summarize it. 2.We also learnt how to interact with REST Apis and about Readability scores like the Flesch Readability score. 3.We also learnt a lot about the Data privacy laws while researching on topics. 4.We also learnt about some advanced NLP models like LDA topic modelling, TextRank summarization and implemented them but its output wasn't that good.

What's next for Light House

  1. We can try out Machine Learning and Deep Learning techniques to get more accurate results of summary.
  2. We will renovate the UI
  3. Creating a browser extension that can be activated when the user has accessed the terms and conditions website to ease the user experience
  4. Trying to add more techniques to improve the analysis and get a better insight

Built With

Share this project:

Updates