Inspiration
Google, while a wonderful resource for quick-fix questions, starts to repeat search results after the third page, this is especially true for news articles. A big reason for this is duplicate resources from common sister agencies like Associate Press and all the newspapers that use its articles, as well as reshares or reposts, artificially inflate the volume of an article/post and its importance. Re-posts or re-shares often are also changed slightly so Google does not see them as duplicates. This causes inflated importance of some posts (going “viral” unnecessarily) and gives a noisy Google search experience that may be hiding more relevant news articles from end-users.
What it does
News articles with the same content are identified and associated with each other in order to prevent inflation of information importance. Attempt to identify duplicate news articles that we scrape from Google or internet search results and what sources those articles commonly come from. This information is used to better enable the public to make sure they're getting the most important and diverse information.
How we built it
To scope this solution, we took 30-50 news articles and posts (even distribution if possible) and create a hypergraph of as many duplicate or near duplicate articles .Then, we assessed the metadata for duplicate information to create a similarity score, the most similar articles/posts create the cluster of articles/posts related to each hyper-node. We decided the threshold for similarity which is over 80% (0.80 f-score) similarity on metadata fields . Representing the similarity of metadata between hyper-nodes allows for the solution to scale so that as new news articles/posts are posted, the metadata can be queried to identify if the new article/post is an existing duplicate, or a new article/post.
The state has two outputs: the first is the model and its populated hyper-graph and the second is the similarity model, likely a machine learning model. The hyper-graph is scoped to have 30-50 hyper-nodes, with at least 2 duplicate or near duplicate articles associated with each hyper-node (so a total dataset of 60-100 individual articles/posts and their metadata). Each hyper-node has the normalized metadata of the articles it represents, as well as the similarity score for the individual articles to one another and the similarity score between each hyper-node. The machine learning model is open-source on Github and is flexible enough to be pointed at any news dataset that has standard metadata such as Google News or social media news feeds like Twitter, and allow for the similarity threshold to be modified.
Challenges we ran into
That was the first time we used tigerGraph ,so it wasn't easy to handle it,anyway we did it.
Accomplishments that we're proud of
We are able to do those things with tigerGraph : 1.) Identify duplicates in a static news dataset in the graph, 2.) Identify if a new article is a duplicate of an existing article in the graph, 3.) Enable others to use the similarity model on news datasets, and 4.) Allow for a search engine to traverse the graph and retrieve the hyper-node (and the articles/posts it relates to) for retrieval and display.
What we learned
We learned that Graph database runs faster than others and telling the truth, we didn't know that we could use machine learning algorithms in tigerGraph till we discovered that it included everything related to Graph Analytics and that is awesome.
What's next for ReduceNoiseOfNewsSearch
The next step for our solution is to develop something like connected papers or Google scholar to represent articles through search. Then final users will be able to use ReduceNoiseOfNewsSearch.
Built With
- gsql-graph-algorithms
- gsql-language
- newsapi.org
- python
- tgcloud.io
- tigergraph



Log in or sign up for Devpost to join the conversation.