Inspiration
I wanted to make a unified dashboard which shows new and incoming news articles which are summarised and presented. As well as this, I wanted to have a system which found key points and summaries of news articles to show a good enough overview to the lay person.
What it does
Summarises news articles and finds keywords.
How I built it
This project uses a lot of very interesting algorithms and logic to work. I needed to make the summarisation system as fast and light as possible which was a challenge but was made easier with the discovery of TextRank.
This is an extract of the code for TextRank created with the help of tutorials from various websites.
# GeeksForGeeks, AnalyticsVidhya
def generate_summary(text, top_n=1):
stop_words = stopwords.words('english')
summarize_text = []
"""Summarizer workings:
-> Read and split text
-> Generate a sim matrix
-> Rank the sentences using networkx pagerank (google search algorithm used since 1998)
-> Sort and pick
EXPLAINED IN GREATER DETAIL BELOW
"""
sens = read_article(text)
sen_sim_martix = build_sim_matrix(sens, stop_words)
sen_sim_graph = nx.from_numpy_array(sen_sim_martix)
scores = nx.pagerank(sen_sim_graph)
ranked_sen = sorted(((scores[i],s) for i,s in enumerate(sens)), reverse=True)
for i in range(top_n):
summarize_text.append("".join(ranked_sen[i][1]))
# OUTPUT THE TEXT HERE
return ". ".join(summarize_text)
TextRank is simple. The first step is reading in and splitting the text, continuing on from this it generates a similarity matrix. By similarity I refer to cosine similarity or cosine distance.
Diving deeper into cosine similarity, it works like this: Lets start off with two sentences : "The quick brown fox jumps over the lazy dog" and "The fast fox hops over the relaxed dog"
The first step is removing "stopwords". Stopwords are words which contribute nothing to the sentence but are only there for the sake of grammar. These are words such as "the" and "and".
The next step is finding the vectors of these sentences. Then we create a similarity matrix and apply TextRank.
TextRank is very similar to the Google PageRank algorithm. I chose it because of its speed and elegance. It uses networkx for this.
The second part was the keyword/topic extraction system called an LDA which stands for Latent Dirichlet Allocation. This was something I was new to but was fascinated by so I watched videos to understand it properly.
https://www.youtube.com/watch?v=Cpt97BpI-t4
Challenges I ran into
The NewsAPI only returns part of the text.
Accomplishments that I'm proud of
Learned about frontend work and making cards in css.
What I learned
Learned about APIs. Web-design was a challenge since I don't have much experience.
What's next for Coronavirus news dashboard
Perhaps extend to better NewsAPI.
Built With
- flask
- networkx
- newsapi
- sklearn
Log in or sign up for Devpost to join the conversation.