Inspiration
In recent days and weeks, news emanating from the COVID-19 pandemic has been, well, bleak. Just like a lighthouse on the water, Beacon hopes to be a beacon of light in the darkness by providing some of the more uplifting news articles about COVID-19 to people around the world.
What it does
Beacon scrapes the web for positive news articles related to the Coronavirus pandemic and filters those articles using sentiment analysis and machine learning. It then regularly checks Johns Hopkins University's open data set to determine the number of people who have recovered from COVID-19. Every time an additional 10,000 people recover, Beacon sends a newsletter of positive articles regarding the outbreak to anyone who would like to add a smile to their day.
How I built it
Beacon is built mainly from Python3, implementing many external libraries to get the job done. Web scraping is conducted with Selenium, and article parsing was made very simple thanks to the newspaper3k library which can be found at https://newspaper.readthedocs.io/en/latest/.
Sentiment analysis was made possible by the textblob library available, and machine learning was conducted with the help of the scikit-learn library.
All of these technologies come together as a Google search is first executed looking for positive articles. I like to refer to the utilization of Google here as a "stage 0 filter." The list of links provided by the search are then compiled into a Python list object. Once this is complete, Beacon iterates over every link, getting the source HTML of an article page. This source code is then parsed with the newspaper3k library in order to obtain the text and title of the article only. Sentiment analysis is then conducted to calculate the polarity (i.e. the positivity or lack thereof) of the article and its title.
After polarity is calculated, all articles with a negative polarity are dropped from the set to be inspected. This is what I like to refer to as my "stage 1 filter," helping with both accuracy and efficiency of the program. Then the combination of an article's title and text polarities are utilized as the main features for a machine learning algorithm, Gaussian Naive Bayes (GNB), that works to classify the article as either positive or not. I refer to this final refinement step as my "stage 3 filter."
Once a list of positive articles is compiled, these articles are added to an email message crafted in HTML.
Every 24 hours Beacon checks open source data provided by Johns Hopkins University to determine the number of people who have recovered from COVID-19 since last the data was checked. If 10,000 additional people recover from the disease, an email blast with positive news articles regarding the global pandemic is sent to the mailing list.
Challenges I ran into
The first odd challenge I ran into during this project was the presence of invisible links in Google searches. Beacon first executes a Google search via Selenium and finds all the links on a page. In order to avoid analyzing random Google support links, I got around the problem by dropping any links with the word "Google" in them from the set to be analyzed.
Ironically, the word "positive" itself does not indicate a good or positive article. Especially when analyzing articles regarding testing for the virus, this was an issue. For example, "Boris Johnson tests positive for COVID-19" is not a good thing. This was mostly handled as I expanded the training set used for GNB via more web scraping.
Somewhat unsurprisingly, some sites do not like being scraped and throw 403 Forbidden errors when Beacon attempted to get their contents. A repeated number of requests to the same site often caused an error. To remedy the situation, I simply surrounded any GET requests with try statements so that failed requests can fail gracefully.
Another tough issue is the lack of pre-classified data (CV-19 related news articles) as positive or negative. As a quick fix in order to use Baye's classification algorithm, I automatically scraped articles from positive and negative Google searches, classifying those from a positive search as a '0' and those from a negative search as a '1'. Although this can be faulty, I found that a large enough data set allowed for accuracy at an acceptable level.
What I learned
With no prior experience in web scraping, I was able to successfully pick up a large chunk of the sub-field in a short amount of time. I learned Selenium and was able to put it to good use quickly.
I also had only ever heard of sentiment analysis, and I learned the basics of NLP and SA to successfully complete this project.
Aside from these new skills, I also learned how to send emails in Python, more properly perform HTTP requests and run jobs on a timer.
What's next for Beacon: COVID-19 Optimism Bot
Higher volume web scraping with a dedicated server will allow for a more accurate model I believe. This combined with testing of other supervised algorithms such as support vector machines and linear regression models will allow for the most accurate model. GNB was employed because research yielded that it was the best mix of complexity (low) and accuracy (high) for the given problem. At least, it was for a group of Stanford undergrads!
Built With
- ai
- csv
- machine-learning
- newspaper
- pandas
- python
- scikit-learn
- selenium
- textblob
Log in or sign up for Devpost to join the conversation.