I really like the problem statement that the Wildlife Conservation Society wrote up about news article data. I was inspired to make this process better because of my current career. I advertise myself as a Software Engineer (I wrote code), but my formal title is Data Engineer. I deal a lot with writing robust code to process petabytes and petabytes of data in an efficient way. WCS's problem seemed like a very satisfying data process to solve that would allow them to tell stories about the data they discover in wildlife trafficking articles faster.

What it does and how I built it

My solution includes a few things:

  1. Pulls text data from news articles related to specific key words using Python, Jupyter, and the newsapi.
  2. Derives new contextual data (like identifying if the article talks about a particular wildlife organization or a particular country's name) from the raw news article data using Natural Language Processing methods.
  3. Stores the raw data and the NLP data in a Postgres database.
  4. Makes the data in that Postgres database available through Apache Superset, a dashboarding tool.

Challenges I ran into

  • Scope creep!
  • How to make the solution appealing to the eye
  • How to make the solution usable in a practical way (if it were stood up tomorrow)

Accomplishments that I'm proud of

  • Performing all these steps in only 2 days!

What I learned

  • Even if you're a remote participant, GET A TEAM!

What's next for Automated Pipelines for Wildlife Trafficking News Articles

  • Improving its NLP functionality - I'd like to add in automated summaries of wildlife articles.
  • Prototype this entire architecture in AWS so that it's easy to stand up and tear down in a scalable way

Built With

Share this project: