Automated Pipelines for Wildlife Trafficking News Articles

Inspiration

I really like the problem statement that the Wildlife Conservation Society wrote up about news article data. I was inspired to make this process better because of my current career. I advertise myself as a Software Engineer (I wrote code), but my formal title is Data Engineer. I deal a lot with writing robust code to process petabytes and petabytes of data in an efficient way. WCS's problem seemed like a very satisfying data process to solve that would allow them to tell stories about the data they discover in wildlife trafficking articles faster.

What it does and how I built it

My solution includes a few things:

Pulls text data from news articles related to specific key words using Python, Jupyter, and the newsapi.
Derives new contextual data (like identifying if the article talks about a particular wildlife organization or a particular country's name) from the raw news article data using Natural Language Processing methods.
Stores the raw data and the NLP data in a Postgres database.
Makes the data in that Postgres database available through Apache Superset, a dashboarding tool.

Challenges I ran into

Scope creep!
How to make the solution appealing to the eye
How to make the solution usable in a practical way (if it were stood up tomorrow)

Accomplishments that I'm proud of

Performing all these steps in only 2 days!

What I learned

Even if you're a remote participant, GET A TEAM!

What's next for Automated Pipelines for Wildlife Trafficking News Articles

Improving its NLP functionality - I'd like to add in automated summaries of wildlife articles.
Prototype this entire architecture in AWS so that it's easy to stand up and tear down in a scalable way

Built With

jupyter
newsapi
postgresql
python
spacy
superset

Updates

David Dalisay started this project — Nov 09, 2019 11:32 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.