Everybody these days lead a very busy life with no time to spend on long newspapers and news articles. This site was designed to help these people catch up with the world without spending a ton of time reading the articles!
What it does
It provides a collection of articles, simplified AI-generated summary, link to the original article and an audio version of the news that one can listen to as they start the day.
How we built it
We built a python webserver that can crawl the news sites, generate summaries using AI and serve the articles to end user in a simplified and clean layout. The app is hosted in Google Cloud App Engine and uses Firebase cloud firestore as NoSQL data store. We also use Google Cloud Speech to Text API service to transcribe the generated summary to audio.
Summarizer developed is an Extractive text summarizer. Extractive text summarization involves the selection of phrases and sentences from the source document to make up the new summary. Techniques involve ranking the relevance of phrases to choose only those most relevant to the meaning of the source. The algorithm uses several features to score the relevance of a sentence:
- TF IDF: is a numerical statistic that reflects how important a word is to a document in a collection or corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.
- Similarity to headline
- Position of the sentence in the article: This assumes that sentences in the starting and in the ending are more important
- Length of the sentence
All of these are weighted to produce a single score for each sentence in the article and the most important sentences are picked to form the summary.
Challenges we ran into
We developed a crawler to crawl the web for latest updates but data quality/text cleaning became an issue. The text that we crawled had many external links, noise and adds that we couldn't get rid of in the timelines. This degraded the quality of the summarizer. This led us to crawl a specific news site and generate the summaries for it.
Accomplishments that we’re proud of
- We were successfully able to get the application to the Google Cloud app engine and use Firebase effectively to host the application. This should help us scale the application easily!
- Summarizer efficiently summarizes the articles to present the most important parts of the article.
- Summary as speech gives an excellent option for the busy user to listen in and helps multi-tasking! ## What we learned We learn a lot of concepts on NLP, hosting an application in Google Cloud, using firestore as a datastore and many more! Also learning to collaborate virtually is worth a mention :D
What’s next for tech-insights - simplified news site for busy technologists
- Customize user-specific news feed based on categories of interest
- Crawler to dynamically crawl for new tech news
- Try Abstractive text summarization - involves generating entirely new phrases and sentences to capture the meaning of the source document.