COVITA (Covid19 Text Analyzer)

Inspiration

Build a comprehensive Text Analysis App and Dashboard to help gain real-time insights from Covid-19 documents and social messages. There is an enormous opportunity to help people suffering from mental anxiety through positive messages and share relevant insights with corresponding agencies

What it does

The goal for project COVITA (Covid19 Text Analyzer) is to analyze the Covid19 texts like medical research documents and tweets in order to find the relevant topics, understand user intents, find geospatial outbreaks, identify and heal mental anxieties , find availability of medical kits. Once we showcase the capability of our text analysis, we would like to extend it further to perform medical document recommendation, identify sensitive information, spread positive uplifting messages , predict outbreaks and create a marketplace for medical kit providers.

How I built it

Development Environments: Databricks, Google Colab, Azure, Google Notebook Instance (Jupyter)
Technologies: Spark( Spark-SQL, Spark-NLP) Python (Core libraries, Spacy, Pandas, Helium, Numpy)
Code: https://github.com/hacking-for-humanity/COVITA
Current data storage: Google Cloud Bucket

Challenges I ran into

We faced challenges in terms of storing large data , processing massive volume of tweets and availability of persistent computes, We relentlessly fixed issues and leveraged multiple cloud platforms like Azure and Google and used multiple development environments like Colab Notebook , Databricks , Dataproc and Google VMs in order to store data and utilize the free compute power as much as possible. We also also faced issues running Spark-NLP licensed version inside Databricks instance.

Accomplishments that I'm proud of

I am able to explore many different areas and gain insights as explained in this document https://github.com/hacking-for-humanity/COVITA/blob/master/Project%20COVITA%201.0.pdf

Analyze the Hash-tags, User-mentions, Follower and Favourite count data
Extract & Analyze the Entities like (Persons, Organizations, Events, National Groups, Locations) from Tweets
Analyze the Biological Named Entities in Tweet and Research Literature
Create the Topic clusters over a period of time
Analyze the sentiments associated with the Topics
Analyze the Mental Anxiety Pattern
GeoSpatial Analysis of Outbreaks
Build and augment prediction model for outbreak I have found a lots of new problems to solve while working on this hackathon and I am very much excited about it.

What I learned

Spark-NLP , Biological NER , Advanced GeoSpatial Visualizations, Deeper understanding of the problem domain and learnt many interesting new requirements as future work

What's next for COVITA

extend our work further to perform medical document recommendation
identify sensitive information and cluster in relevant topics over period of time,
spread positive uplifting messages to users showing signs of chronic mental anxieties
build a robust outbreak prediction and alerting app
create a marketplace for medical kit providers by identifying demand and supply.
build a recommendation model for suggesting and highlighting relevant impactful hashtags and users.
explore different types of NER models like en_core_web_lg and en_core_web_md
cluster the different entities based on different metrics like sentiment, followers, frequency with different statistical variations (rate_of_change , moving average, std dev etc.) and create a time-series of the above statistical variations to detect anomaly and patterns.
create more sophisticated geospatial maps by correlating infection rate with outbreak locations

Built With

jupyter
particle
python
spacy
spark-sql
sparknlp

Updates

Kaniska Mandal posted an update — Jun 13, 2020 01:38 PM EDT

Access Tweet Data

-- Tweets data will be kept available in the following bucket during the time of Hackathon for testing. https://console.cloud.google.com/storage/browser/bucket-covid/TweetData/COVID-19-TweetIDs-master

-- If above location not accessible, simply download Tweet Ids from - https://github.com/echen102/COVID-19-TweetIDs and then hydrate using this tool https://github.com/DocNow/hydrator

Run Notebook in Databricks

-- (1) For small amount of data, upload hydrated tweets into Databricks Notebook

-- (2) Remove Google Cloud Hadoop Storage class from Spark Config as we shall directly access files from dbfs:// or /FileStore/

Run Notebooks in a Jupyter Instance

-- (1) One can create Free Jupyter Notebook Instance in Google VM with 60G RAM, 400G HDD https://cloud.google.com/ai-platform/notebooks/docs/create-new

-- (2) Just simply do the local port forwarding gcloud compute ssh --project --zone -b -- -L 8088:localhost:8080

-- (3) Access the Notebook http://localhost:8088/lab?

-- (4) Ensure google-hadoop jar is present in spark home so that tweets from buckets can be accessed easily gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar /opt/conda/lib/python3.7/site-packages/pyspark/jars/

-- (5) scripts for installing java and pyspark specified in the corresponding notebooks

Log in or sign up for Devpost to join the conversation.

Kaniska Mandal started this project — May 12, 2020 11:55 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.