CoronaNews

Inspiration

I wanted to make a unified dashboard which shows new and incoming news articles which are summarised and presented. As well as this, I wanted to have a system which found key points and summaries of news articles to show a good enough overview to the lay person.

What it does

Summarises news articles and finds keywords.

How I built it

This project uses a lot of very interesting algorithms and logic to work. I needed to make the summarisation system as fast and light as possible which was a challenge but was made easier with the discovery of TextRank.

This is an extract of the code for TextRank created with the help of tutorials from various websites.


# GeeksForGeeks, AnalyticsVidhya
def generate_summary(text, top_n=1):
    stop_words = stopwords.words('english')
    summarize_text = []


    """Summarizer workings:

    -> Read and split text
    -> Generate a sim matrix
    -> Rank the sentences using networkx pagerank (google search algorithm used since 1998)
    -> Sort and pick
    EXPLAINED IN GREATER DETAIL BELOW
    """

    sens =  read_article(text)
    sen_sim_martix = build_sim_matrix(sens, stop_words)
    sen_sim_graph = nx.from_numpy_array(sen_sim_martix)
    scores = nx.pagerank(sen_sim_graph)
    ranked_sen = sorted(((scores[i],s) for i,s in enumerate(sens)), reverse=True)    

    for i in range(top_n):
      summarize_text.append("".join(ranked_sen[i][1]))

    # OUTPUT THE TEXT HERE
    return  ". ".join(summarize_text)

TextRank is simple. The first step is reading in and splitting the text, continuing on from this it generates a similarity matrix. By similarity I refer to cosine similarity or cosine distance.

Diving deeper into cosine similarity, it works like this: Lets start off with two sentences : "The quick brown fox jumps over the lazy dog" and "The fast fox hops over the relaxed dog"

The first step is removing "stopwords". Stopwords are words which contribute nothing to the sentence but are only there for the sake of grammar. These are words such as "the" and "and".

The next step is finding the vectors of these sentences. Then we create a similarity matrix and apply TextRank.

TextRank is very similar to the Google PageRank algorithm. I chose it because of its speed and elegance. It uses networkx for this.

The second part was the keyword/topic extraction system called an LDA which stands for Latent Dirichlet Allocation. This was something I was new to but was fascinated by so I watched videos to understand it properly.

https://www.youtube.com/watch?v=Cpt97BpI-t4