What it does
This program has two parts, an aggregator/ranker, and a web page. The back-end aggregator web-scrapes news outlets (we chose NDTV, The Indian Express, and Deccan Chronicles) and pulls out political articles. Then each is run through VADER to determine their overall sentiment, but explicitly just the magnitude of such (since high negative can be just as bad as high positive). Higher magnitudes are perceived to be worse sources of news than others since this implies that the author's word choice can imply strong biases. This, along with the article link, article title, and article summary, is compiled into a list.
This list is taken in by a Data Access Object (DAO) that proceeds to write this scraped data to our instance of DataStax Astra for long-term storage and relieves the need to rescan unless a new source with more links needs to be examined. This data is stored in a tabular fashion (due to Apache Cassandra's structure) and operations for queries and inserts are done solely in CQL (Cassandra Query Language) which looks an awful lot like SQL. However, the database is a NoSQL database in nature (even though it still returns results in ResultSets).
The front-end displays the database data following an API call for queries. The data is sorted by score in ascending order with the weakest sentiment towards the top and the stronger wordings towards the bottom. Data can be retrieved in two ways: Reload and Direct
Additionally, we implemented our own summarization tool which allows the reader to skip over the hassle of navigating and reading the news source. Aditya coded a term frequency-inverse document frequency algorithm and then grammar reconstruction function with the aid of nltk to make the important words and phrases not sound like a random string of vernacular and rather a smooth flow of thought akin to human writing.
This calls on the server to reperform the search as something may have changed like adding additional data sources. The server executes the web scraping function then loads that new data into the database. This results in a simple output of declaring that the process has finished which prompts the front-end to redirect to the news list out. This is unique because it allows our system to process dynamic changes in the news and subsequently parses and updates the feed (assuming you consistently reload, also the database will update daily).
This allows the user to instantly access the news sites we currently have processed and see their scores on the UI we created. Each site has its title, score, and summary visible. Clicking on each of the cards will open a new tab to the article link. Queries to Astra are notoriously quick, with this site loading faster than the same coming out of MongoDB or DynamoDB. The new tab was intentional as to not cause reload errors if the user was redirected from the news list.
How we built it
We used Flask to design a back-end server to respond to our React-based front-end/UI. The React app is proxied such that we do not need to use CORS to send requests. It effectively comes from a single port in the application. The backend is connected to an instance of DataStax Astra, one of the fastest cloud versions of Apache Cassandra. A table was initialized in the DataStax Astra Studio. We created a database model representing a news site consisting of the following:
link: text score: float summary: text float: text
This would be stored in a corresponding table to be retrieved later. On the front end, we included functionality for a reload, but also a direct query which queries Astra, sorts the data accordingly, and displays it on the UI as "The News!"
Challenges we ran into
Astra was notoriously hard to set up. You not only need a driver (which is only available inside the dashboard), but als a singleton session, CQL, and proper formatting instructions. Tony spent almost 3 hours just attaching the connections and services in order to get the proper data flow. Initially, I tried to run a CQL to String function on
schema.cql, but gave up somewhere in the middle and initialized the table on a CQL notebook in the DataStax Astra Studio.
This was our first time with web scraping for articles and Aditya had to create a filter for the junk (ie: non-relevant links and articles and other things)
Accomplishments that we're proud of
We finished! And we didn't just do the back-end challenge! We did BOTH
What we learned
Prepare to learn about new topics earlier on. Tony learned Astra querying and set up at 11 PM eastern the day before this was due. It may even become a critical element that makes your project seem more unique.
TF*IDF - Aditya implemented this for our bonus summarization feature and keyword id with nltk.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is an amazing package. It basically judges how expressive, positive, negative, or neutral and determines how it sounds. The higher the magnitude, the stronger the sentiment. This was used as the primary categorizer for the different links we scraped from the news outlets.
What's next for Unbias.ly
We want to allow a customized way for users to simply check the bias of the article