ReduceNoiseOfNewsSearch

To scope our solution, we retrieved One hundred articles about Tesla from the last month, sorted by recent first using general news API.

Inspiration

Google, while a wonderful resource for quick-fix questions, starts to repeat search results after the third page, this is especially true for news articles. A big reason for this is duplicate resources from common sister agencies like Associate Press and all the newspapers that use its articles, as well as reshares or reposts, artificially inflate the volume of an article/post and its importance. Re-posts or re-shares often are also changed slightly so Google does not see them as duplicates. This causes inflated importance of some posts (going “viral” unnecessarily) and gives a noisy Google search experience that may be hiding more relevant news articles from end-users.

What it does

News articles with the same content are identified and associated with each other in order to prevent inflation of information importance. Attempt to identify duplicate news articles that we scrape from Google or internet search results and what sources those articles commonly come from. This information is used to better enable the public to make sure they're getting the most important and diverse information.

How we built it

To scope this solution, we took 30-50 news articles and posts (even distribution if possible) and create a hypergraph of as many duplicate or near duplicate articles .Then, we assessed the metadata for duplicate information to create a similarity score, the most similar articles/posts create the cluster of articles/posts related to each hyper-node. We decided the threshold for similarity which is over 80% (0.80 f-score) similarity on metadata fields . Representing the similarity of metadata between hyper-nodes allows for the solution to scale so that as new news articles/posts are posted, the metadata can be queried to identify if the new article/post is an existing duplicate, or a new article/post.

The state has two outputs: the first is the model and its populated hyper-graph and the second is the similarity model, likely a machine learning model. The hyper-graph is scoped to have 30-50 hyper-nodes, with at least 2 duplicate or near duplicate articles associated with each hyper-node (so a total dataset of 60-100 individual articles/posts and their metadata). Each hyper-node has the normalized metadata of the articles it represents, as well as the similarity score for the individual articles to one another and the similarity score between each hyper-node. The machine learning model is open-source on Github and is flexible enough to be pointed at any news dataset that has standard metadata such as Google News or social media news feeds like Twitter, and allow for the similarity threshold to be modified.

Challenges we ran into

That was the first time we used tigerGraph ,so it wasn't easy to handle it,anyway we did it.

Accomplishments that we're proud of

We are able to do those things with tigerGraph : 1.) Identify duplicates in a static news dataset in the graph, 2.) Identify if a new article is a duplicate of an existing article in the graph, 3.) Enable others to use the similarity model on news datasets, and 4.) Allow for a search engine to traverse the graph and retrieve the hyper-node (and the articles/posts it relates to) for retrieval and display.

What we learned

We learned that Graph database runs faster than others and telling the truth, we didn't know that we could use machine learning algorithms in tigerGraph till we discovered that it included everything related to Graph Analytics and that is awesome.

What's next for ReduceNoiseOfNewsSearch

The next step for our solution is to develop something like connected papers or Google scholar to represent articles through search. Then final users will be able to use ReduceNoiseOfNewsSearch.

Built With

gsql-graph-algorithms
gsql-language
newsapi.org
python
tgcloud.io
tigergraph

Submitted to

Graph For All Million Dollar Challenge

Created by

Notre solution est la reduction des bruits des actualités. Pour la mise en place de cette solution, j'ai travaillé sur le Front-End du moteur de recherche.

Damo Eloiflin
We worked on the following theme: Reduce noise in news research. In this project I took part in the implementation of algorithms of similarity calculation (I used the algorithm of Similitude Jaccard of the districts) and the classification with the algorithm kNN.
In this project I learned a lot about tiger graph and graph theory, especially the notions of graphs which are related to the development of algorithms. And also I received useful information related to the data that we exploited.

TUO Navigué
Yatana Jean De Dieu BLE
amaninoblesse Amani