Analysis of Word Choices in News Articles

Inspiration

News media are important sources of information on the Covid-19 pandemic as they report safety guidelines, health data, government actions, and the public opinion Yet, News media sources may cover different topics or change their writing style based on their political affiliation to cater to their audience. The goal of our project is to discover if news sources truly differ in their reporting for the purpose of furthering their political affiliations’ agenda. The ramifications of this type of reporting is an increased social divide among the public

What it does

Prints out accuracy score of 84% from classifying news source of an article Finds the most frequent words from each news source and produces word clouds Find if the average term frequencies are statistically significant

How we built it

We used Random Forest and TF-IDF Vectorizer to build machine learning model, used word cloud visualizations, used catplots to visualize variability in our data.

Challenges we ran into

It was hard to scrape COVID articles from CNN because the website has restrictions on what part of page content we are allowed to scrape. Random Forest Algorithm was hard to understand. Our classification model was overfitting initially.

Accomplishments that we're proud of

We successfully built a classification model that has 84% model accuracy and made cool word clouds and applied our statistical knowledge.

What we learned

We learned more about how to scrape precise parts of a page, modularize our codes with clean format and concise comments, and learned to use a lot of different ML libraries.

What's next for Analysis of Word Choices in News Articles

We want to be able to pass a random article's URL to our classification model and let it guess if it comes from Fox or CNN. We can use more NLP techniques to further breakdown the differences in writing style between the two sources.

Built With

beautiful-soup
classification
newspaper
nltk
python
statistics
wordcloud

Updates

Johnny He started this project — Mar 16, 2022 07:38 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.