A lot of fake news, click baits and extreme media bias have led people to misinformation and to make choices they wouldn't if they had better and trusted sources.
What it does
It gathers information from the main Brazilian newspapers intending to score their trustfulness.
How I built it
It has a couple of components. I'll depict them:
- page monitor: the page monitor is dispatched every 2 minutes for checking front pages from the main pages of the Brazilian newspapers. It uses HTTP requests, inspects checksums, header information from pages and web servers parameters. Then, it triggers another component that scrapes the new information found in the newspapers' frontpage. It uses Shell scripts and inspects HTTP native headers and headers added by popular web servers.
- scraper: It scrapes the new information for published articles on the newspapers' frontpage using Python Scrapy and saves it into a Mongo database.
- database: A Mongo Database stores article information, such as URL, title, subtitle, article content, publication date, author, etc.
- backend: A Golang renders the front page with the newest published articles. It also provides services that check if there are new articles published since the last time the front page was rendered. if so, it returns the number of new articles published.
Challenges I ran into
Avoiding excessive requests to the newspaper pages and yet getting almost instantaneous data required a combination of techniques. Before doing an actual GET request, other alternatives are tested, like HEAD requests, checksum comparison, inspecting native HTTP headers and headers from popular web servers.
Accomplishments that I'm proud of
Headlines Center gets the new articles almost instantaneously from the newspapers, becoming a trusted source for the fresh new information from Brazilian the media.
What I learned
- Improved my skills in python
- Learned how to scrape pages using Scrapy and how to set up scrapyd service.
- Learned the basics of GoLang.
- Learned the basics of MongoDB
- Improved my knowledge of HTTP headers and requests.
- Improved my knowledge of Linux services administration and configuration.
What's next for Headlines Center
Currently, It works as a news aggregator that already delivers values. There are a few users who use it as an alternative to Google News because it is more clean and simple. Also, it only shows information from trusted sources curated by this developer for the moment. As it has valuable information stored, the next steps are:
- Extracting semantic information from the articles and classifying them as favorable and not favorable to a certain entity (e.g., person, company, institution...)
- Delivering the same service as a mobile app. It is available as a responsive webpage right now.
- Analyzing speech and media bias over time to help in analyzing source truthfulness.