Introduction:
Biases infiltrate many aspects of spoken and written language. In the sphere of politics, media outlets, as well as individual journalists, often have strong political motivations that can lead to biased sentences in their published works. In our current polarized political climate, the quantity of media being produced from both conservative and liberal outlets has risen greatly. This has created an incredibly useful challenge for NLP researchers to automate the detection of political bias. In the paper that we are implementing, Political News Bias Detection using Machine Learning, the author Minh Vu describes the objectives of building a MultiLayer Perceptron model that characterizes articles by the percentages of liberal, conservative, and neutral sentences. This is an opinion classification problem. We believe this is a nontrivial area of research as the political biases that we are fed can greatly influence our own internalized biases moving forward, which can, in turn, cyclically feed us more political media with those biases. Therefore, it is important to be able to acknowledge and recognize biases present in the content we consume.
Challenges:
So far, the hardest part of the project has been preprocessing, determining data structure, and deciding which part of the data to keep and discard. Our data came in a pickled file and also was organized in parse tree fashion. After primary extraction, we ended up with a list of “trees” - each tree has nodes that consist of the phrases in that sentence. Each sentence had a label, but also each sub-phrase had a label. One of the more confusing parts for us was that sentences and phrases could have different labels (i.e., sentences that were coded as liberal could have sub-phrases coded as conservative). So, we decided to focus on phrases for the initial training portion of our project and are planning to go back to using sentences later to compare if we have time. We went about extracting only the phrases to a large corpus. It’s a len(corpus) x 2 list, where each element is a phrase (string) and its accompanying label.
Insights:
In order to preprocess our data, we had to decide the best way in which to separate our data into training and testing splits. The IBC dataset is organized such that each annotated sentence is composed of annotated phrases (liberal, conservative, or neutral), however, not all sentences are annotated, nor are all phrases. After initially organizing our data by creating a dictionary of labeled sentences with values set to lists of labeled phrases, we recognized that this was not the most effective pre-processing strategy to maximize the usage of all the data. Instead, we decided to solely look at phrases, increasing the net amount of data we have to work with. We decided upon a 90% split of training to testing, which translates to 20,358 phrases in the testing dataset and 2,263 in the testing dataset. Additionally, although previous implementations have discarded neutral-labeled phrases and sentences, we decided it was better to keep these phrases and introduce a third classification. Here is an example of three phrases, all originating from the same sentence, in the testing dataset that are annotated as liberal.
[['Forcing middle-class workers to bear a greater share of the cost of government weakens their support for needed investments and stirs resentment toward those who depend on public services the most .', 0], ['weakens their support for needed investments and stirs resentment toward those who depend on public services the most', 0], ['stirs resentment toward those who depend on public services the most', 0]]
Plan:
Now that we have our data split, labeled, and processed, we are ready to begin training our model. We will create a multi-layer perceptron. We will also create a novel embedding matrix to create vectors as inputs for this model. Our data comes in sentences and phrases. We are thinking of using phrases instead of sentences to do our initial training and, if we have time, going back and using sentences to compare performance. We are choosing phrases because we have a larger corpus with more specific text that way.
Log in or sign up for Devpost to join the conversation.