Project description: https://github.com/code-for-venezuela/2019-april-codeathon/tree/master/challenges/MPV-INF

Here's our code: https://github.com/cvielma/Code4Venezuela

What the code does

This code has different parts:

Twitter data extraction
Stream Processing (via Flink)
NLP processing (using CoreNLP)
Storage to database (MongoDB)

We provide 2 ways to get data from Twitter pending approval for Premium API access. One is in a polling way using current free apis, and the other via the Flink connector. Since the main solution is based on Flink, what we do is to use the first (poll) and publish to Kafka which is then consumed by Flink (AppKafka.java), the other (streaming) uses Flink's Twitter Connector (AppStream). Both use the same Flink Pipeline (Pipeline.java)

We have trained a simple model (in the resources/train_data folder) in Spanish, using existing tweets from the initial data set in the project, as well as other sources, and were able to tag based on:

NEEDS: people needing medicine
MED: medicine names
OFFERS: people offering medicine
LOC: to indicate location
CONTACT: contact information
SICK: sickness or diseases

The Pipeline then extracts the text from the tweet, processes it using the NLP model and stores it in the DB.

The project includes more things, and we might expand it in the future to include things like: deduplication, a better tagging model, storing geolocalization data, and more things that could help AI and data mining.

Built With

Share this project:

Updates