This project uses code, data, and resources that were not made by myself or any collaborators. All non-original works (including code, works of art, writings, data, and resources) are copyright of their original owners and used here for non-commercial, open-source purposes.


-Document Collection Analysis for Microsoft -Microsoft Azure ML Studio -Bing Custom Search -Bing News API -Python 3.6

How We Came Up With The Idea

Fake news is a real issue, and we decided to tackle the issue head on. Rather than classify similar talking points between sources as "facts", we argue that it's consensus information. If a majority of scraped sources have a trend of similar ideas, we would then label the data as being in consensus.


None of us are machine learning specialists (and far from it!), so everything we did was completely new to us. We come to hackathons to learn something new, not do what we are comfortable with. We also didn't know how to build a web scraper, nor how to link any of the components together. While we wouldn't want to complain about it, traveling from California did take it's toll, as the time difference and red-eye jet lag contributed to being tired at odd times.


We plan on working on this after the hackathon. Definitely a project of this scope, and to get it perfect, requires much more time than 36-hours. Additionally, the ML model we used from Microsoft is documented on a deprecated framework. Before we even go further, we are excited to migrate this algorithm completely to Azure ML Studio. This is so others can start building with this algorithm with less prep time, unlike us where we spent countless hours learning how to configure it.

Link to the Microsoft DOCS site

The detailed documentation for this real world scenario includes the step-by-step walkthrough:

Link to the Gallery GitHub repository

The public GitHub repository for this real world scenario contains all materials, including code samples, needed for this example:

This scenario demonstrates how to summarize and analyze a large collection of documents, including techniques such as phrase learning, topic modeling, and topic model analysis using Azure ML Workbench. Azure Machine Learning Workbench provides for easy scale up for very large document collection, and provides mechanisms to train and tune models within a variety of compute contexts, ranging from local compute to Data Science Virtual Machines to Spark Cluster. Easy development is provided through Jupyter notebooks within Azure Machine Learning Workbench.


The prerequisites to run this example are as follows:


This advance scenarios for Document Collection Analysis collects usage data and sends it to Microsoft to help improve our products and services. Read our privacy statement to learn more.


This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (for example, label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information, see the Code of Conduct FAQ or contact with any additional questions or comments.

Built With

Share this project: