What the code does

This code has different parts:

Twitter data extraction
Stream Processing (via Flink)
NLP processing (using CoreNLP)
Storage to database (MongoDB)

We provide 2 ways to get data from Twitter pending approval for Premium API access. One is in a polling way using current free apis, and the other via the Flink connector. Since the main solution is based on Flink, what we do is to use the first (poll) and publish to Kafka which is then consumed by Flink (AppKafka.java), the other (streaming) uses Flink's Twitter Connector (AppStream). Both use the same Flink Pipeline (Pipeline.java)

We have trained a simple model (in the resources/train_data folder) in Spanish, using existing tweets from the initial data set in the project, as well as other sources, and were able to tag based on:

NEEDS: people needing medicine
MED: medicine names
OFFERS: people offering medicine
LOC: to indicate location
CONTACT: contact information
SICK: sickness or diseases

The Pipeline then extracts the text from the tweet, processes it using the NLP model and stores it in the DB.

The project includes more things, and we might expand it in the future to include things like: deduplication, a better tagging model, storing geolocalization data, and more things that could help AI and data mining.

Built With

flink
java
opennlp

Submitted to

Code for Venezuela SF Codeathon

Created by

I worked on the npl training, flink enrichment and mongodb storage. Also in some administrative parts like setting up repo, base project, adding some docker machines and coordinating submission with other sites.

Christian Vielma
I worked on the nlp part, suggesting the library, cleaning up the example data provided on the original work description and tagging words for the nlp training. I also added some gradle compatibility with netbeans

Marcos Grillo
ManuelSalgado

Updates

Christian Vielma started this project — Apr 13, 2019 05:07 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.