I really like the problem statement that the Wildlife Conservation Society wrote up about news article data. I was inspired to make this process better because of my current career. I advertise myself as a Software Engineer (I wrote code), but my formal title is Data Engineer. I deal a lot with writing robust code to process petabytes and petabytes of data in an efficient way. WCS's problem seemed like a very satisfying data process to solve that would allow them to tell stories about the data they discover in wildlife trafficking articles faster.
What it does and how I built it
My solution includes a few things:
- Pulls text data from news articles related to specific key words using Python, Jupyter, and the newsapi.
- Derives new contextual data (like identifying if the article talks about a particular wildlife organization or a particular country's name) from the raw news article data using Natural Language Processing methods.
- Stores the raw data and the NLP data in a Postgres database.
- Makes the data in that Postgres database available through Apache Superset, a dashboarding tool.
Challenges I ran into
- Scope creep!
- How to make the solution appealing to the eye
- How to make the solution usable in a practical way (if it were stood up tomorrow)
Accomplishments that I'm proud of
- Performing all these steps in only 2 days!
What I learned
- Even if you're a remote participant, GET A TEAM!
What's next for Automated Pipelines for Wildlife Trafficking News Articles
- Improving its NLP functionality - I'd like to add in automated summaries of wildlife articles.
- Prototype this entire architecture in AWS so that it's easy to stand up and tear down in a scalable way