Here's our code: https://github.com/cvielma/Code4Venezuela
What the code does
This code has different parts:
Twitter data extraction Stream Processing (via Flink) NLP processing (using CoreNLP) Storage to database (MongoDB)
We provide 2 ways to get data from Twitter pending approval for Premium API access. One is in a polling way using current free apis, and the other via the Flink connector. Since the main solution is based on Flink, what we do is to use the first (poll) and publish to Kafka which is then consumed by Flink (AppKafka.java), the other (streaming) uses Flink's Twitter Connector (AppStream). Both use the same Flink Pipeline (Pipeline.java)
We have trained a simple model (in the resources/train_data folder) in Spanish, using existing tweets from the initial data set in the project, as well as other sources, and were able to tag based on:
NEEDS: people needing medicine MED: medicine names OFFERS: people offering medicine LOC: to indicate location CONTACT: contact information SICK: sickness or diseases
The Pipeline then extracts the text from the tweet, processes it using the NLP model and stores it in the DB.
The project includes more things, and we might expand it in the future to include things like: deduplication, a better tagging model, storing geolocalization data, and more things that could help AI and data mining.