Hybrid Sentiment Analysis on Patient Drug Reviews

Our Poster from DL Day

Link to Write-up (which has better formatting!): https://docs.google.com/document/d/1xRCduspkaTA25XRKHI0cayyNxUOc6P0OLML8hz1XkXI/edit?usp=sharing

Github: https://github.com/jason-ni-0/1470-dl-final

Introduction

When a new drug is released on to the market, reviews provide insights into the safety profile of the drugs— including potential side effects, any possible adverse reactions, etc— and more information on the efficacy of the product in treating the specified condition. These drugs are all tested in clinical trials and are FDA approved prior to its release but continual monitoring via drug reviews is essential to ensure its performance is as expected. Furthermore, these reviews provide essential information that helps healthcare professionals and patients make informed decisions about medication use.

We perform sentiment analysis to analyze a large data collection, which includes both corpora of text and quantitative measures, on drug reviews to evaluate the sentiment concerning various drugs. Specifically, we feed a hybrid feature vector that combines Aspect Related Features (ARF) and Review Related Features (RRF) into a LSTM classifier. This methodology aims to benefit various stakeholders, from physicians to researchers to organizations like the FDA, by saving an immense amount of time and labor typically required to analyze sentiment— thereby also allowing for the possibility of scaling these tasks. We try to guess the sentiment rating on a scale of 1-10.

Methodology

I. Data & Preprocessing We use the Drug Review Dataset that consists of patient reviews on drugs along with related conditions and a 10 star rating on overall satisfaction (1 as least satisfied and 10 as most satisfied). The data was obtained by crawling online pharmaceutical review sites, 41,467 reviews for the test set, 84,138 for the train set, and 26,054 for validation. Pre-processing follows the paper’s methodology by tokenizing input reviews, stemming each token, removing stop-words (common words, special characters, URLs), and removing words with less than three letters and integers to ensure the dimensional space of raw reviews is “effectively reduced.”

II. Review Related Features (RRF) The hybrid feature vector achieves precision, robustness, and high efficacy by employing two extractions of features from the data. The extractions of RRF features is achieved by combining traditional approaches of n-grams and TF-IDF to capture the polarity for every term within the text, including negations and emoticons. Firstly, the vocab set (a word list) is obtained from the pre-processed reviews by n-gram feature extractions. We use an n-gram range of (2,3) to capture bigrams and trigrams.

Then, the TF-IDF is calculated on the n-gram yield. The combination of both techniques not only shrinks the dimensionality, saving space and computational power, but also results in an effective representation of the review. TF-IDF is a product of two statistics: Term Frequency (tf) and Inverse Document Frequency (idf). Tf is the relative frequency of a term w in a document d and idf captures for the whole corpus of documents D, how many of the documents does the term w appear in. The final TF-IDF score is calculated by multiplying the two values together, before finally, a vector NT(i) is calculated for each feature within the document.

tf: 𝑓𝑤 is the frequency w in document d. For words not in the document, the tf value should be 0, not 0.5
idf: the denominator indicates the number of documents that a term w appears in. There is one idf value for each word in the vocab.

*Ngrami is the list of words associated with the ith review. Ngramd is the list of words associated with the complete data set d.

III. Aspect Related Features Extractions for ARF derived from Schouten et al. aims to count 1) the lemma’s co-occurrences with sentence annotated categories, 2) lemma’s co-occurrences with aspect types, and 3) grammatical dependencies’ co-occurrences with aspect types. Counting lemma’s co-occurrences with sentence annotated categories helps identify patterns in the way words are used within grammatical contexts and provides representation on their syntactic roles. Counting lemma’s co-occurrences with aspect types, on the other hand, helps identify which words are associated with which particular aspects or topics, and provides representation on how these aspects are utilized in text. Lastly, counting co-occurrences of grammatical dependencies with aspect types helps identify how syntactic structures are used to express aspects or topics, where positive sentiments may have a certain pattern and negative sentiments another. We replicate the paper’s approach, which rebuilds the ARF procedure by approximating the weight matrix that corresponds to each category set for every review in the dataset. The algorithm that captures the representation is below.

Q denotes the training set of reviews. Sc denotes the list of aspect categories. Step 1: Identify the review’s set of lemmas, dependencies, and categories. Each set s contains a list of lemmas and dependence forms, and for every input review, there is the list sc of aspect categories extracted. Step 2: Calculate vectorC: all input review aspect categories vectorY: occurrence frequencies for all lemmas and dependency forms vectorX: co-occurrence rates against lemmas and dependencies are calculated Step 3: Calculate vectorW: weight matrix calculated for every pair of vectorX and vectorY (only if vectorX is frequency > 0 to eliminate the challenge of finding the optimal threshold) Step 4: Calculate: vectorA: highest co-occurrence value for each pair of weights in vectorW

IV. Overall Architecture Summary The hybrid feature representation HFV is a combination of ARF and RRF (simply a concatenation of the matrices). To classify the HFVs as positive, negative, or neutral sentiment, a LSTM classifier is employed. The training is performed with one LSTM layer and one densely connected layer. Refer to the figure below for a summary of the model:

Results

Challenges

One particular challenge has been the feature extraction step, as the hybrid feature vector extraction is a novel approach in linguistic representation for NLP. While RRF utilized traditional approaches, the methodologies are still new to our team, and ARF algorithm is an original production by the paper’s authors based on ABSA, or aspect based sentiment analysis. Implementation of the new method was unintuitive at first but by thoroughly reviewing the paper’s documentation and referring to other resources to learn further about the calculations, we continually tweaked our featurization. For example, we referred to other course content within Brown’s CS department, such as Professor Pavlick’s CSCI 1460 Computational Linguistics, to have a deeper understanding of the TF-IDFs (as well as the preprocessing of the data).

Results

We ended up with 43 percent accuracy (random chance being 10 percent) over 10 epochs. Given that the sentiment is over a scale of 1-10, the accuracy is decent (especially when our initial goals were based simply on classifying for positive, negative, and neutral), and having statistically significant results amounted to a rewarding experience and project.

Originally we categorized results into 3 categories (positive, neutral, negative) but we decided to get more granular with our results. Also, the aspect related sentiment analysis method categorizes with 10 categories, so we wanted to preserve this approach from other papers and challenge ourselves through the results.

Reflection

As mentioned in the Challenges section, the novel hybrid feature extraction proposed in the paper transforms pre-processed drug reviews into feature vectors for efficient and robust sentiment analysis. By developing a technical understanding and participating in hands-on practical implementation of the methodology, we not only gained exposure to a new approach in NLP but also recognized the creativity that can be employed in developing new methods for feature extraction— a particularly salient component in NLP models that uniquely differentiates it and can significantly improve task performance.

What do you think you can further improve on if you had more time? If we had more time, we could’ve played around with changing hyperparameters, especially since the training set took quite a while to go through.

By running through the entire process, from preprocessing and extraction features of the data, building the model, and training, testing, and evaluating the results, we have been able to cohesively put together our learnings throughout the semester and gained first-hand experience in novel approaches for sentiment analysis.