Bias in News Media

Political Bias Classifier – MAIS 202 Winter 2025

This project presents a tool to help users characterize the underlying bias of news articles they read. It uses a machine-learning model trained on a dataset of over 3,400 labeled articles to automatically classify underlying political bias (left, leaning left, center, right, leaning right) in a given piece of media.

📌 Project Motivation

Being able to identify fact vs. opinion has become more difficult in the current media landscape, and underlying political bias, even in articles not obviously political in nature, can be difficult to identify. In this context, we wanted to offer tools to empower readers to more confidently use information drawn from the media. This project aims to provide such a tool, using natural language processing (NLP) and deep learning to help readers identify bias in news texts.

📁 Dataset

We used a Kaggle dataset containing:

Title – Headline of the article
Text – Full content
Bias – Label (left, lean left, center, lean right, right)

Link: Kaggle Political Bias Dataset

🧠 Methodology

Text Preprocessing: Cleaning, tokenization, BERT embeddings
Embeddings: Generated using bert-base-uncased via HuggingFace Transformers
Model: Custom PyTorch neural network (BERT_Arch) trained on embeddings
Training Split: 70/15/15 train/validation/test using train_test_split

📊 Evaluation Metrics

Mean Squared Error (MSE): 2.67
F1 Scores: Between 0.55 and 0.71 across classes
Additional evaluation via confusion matrix, accuracy, and recall

🛠️ Tech Stack

Task	Tools Used
Data processing	pandas, scikit-learn
NLP	HuggingFace Transformers (BERT)
Modeling	PyTorch
Evaluation	scikit-learn metrics
Embedding generation	BERT (bert-base-uncased)

Screenshot 2025-04-08 at 7 59 18 PM

We generated this confusion matrix using sklearn.metrics after several epochs of training. This chart makes it clear that there are some issues with our dataset. Namely, upon reviewing the article bias sheet again, it became clear that the vast majority of entries (1838 out of 3371) being classified as left, with an additional 519 entries being classified as leaning left. The effect of this is displayed in the confusion matrix above, as the model has learned that its best strategy is to almost always assume a left label, with only minor allowance given to right labeled articles whose contents are exceedingly dissimilar. To attain better results, it might be preferable to either work with a dataset with a broader array of articles, or to better tailor our training data to accommodate for this discrepancy.

Built With

bert
python

Updates

rtsej Sejnoha started this project — Apr 09, 2025 02:27 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.