Build a comprehensive Text Analysis App and Dashboard to help gain real-time insights from Covid-19 documents and social messages. There is an enormous opportunity to help people suffering from mental anxiety through positive messages and share relevant insights with corresponding agencies
What it does
The goal for project COVITA (Covid19 Text Analyzer) is to analyze the Covid19 texts like medical research documents and tweets in order to find the relevant topics, understand user intents, find geospatial outbreaks, identify and heal mental anxieties , find availability of medical kits. Once we showcase the capability of our text analysis, we would like to extend it further to perform medical document recommendation, identify sensitive information, spread positive uplifting messages , predict outbreaks and create a marketplace for medical kit providers.
How I built it
- Development Environments: Databricks, Google Colab, Azure, Google Notebook Instance (Jupyter)
- Technologies: Spark( Spark-SQL, Spark-NLP) Python (Core libraries, Spacy, Pandas, Helium, Numpy)
- Code: https://github.com/hacking-for-humanity/COVITA
- Current data storage: Google Cloud Bucket
Challenges I ran into
We faced challenges in terms of storing large data , processing massive volume of tweets and availability of persistent computes, We relentlessly fixed issues and leveraged multiple cloud platforms like Azure and Google and used multiple development environments like Colab Notebook , Databricks , Dataproc and Google VMs in order to store data and utilize the free compute power as much as possible. We also also faced issues running Spark-NLP licensed version inside Databricks instance.
Accomplishments that I'm proud of
I am able to explore many different areas and gain insights as explained in this document https://github.com/hacking-for-humanity/COVITA/blob/master/Project%20COVITA%201.0.pdf
- Analyze the Hash-tags, User-mentions, Follower and Favourite count data
- Extract & Analyze the Entities like (Persons, Organizations, Events, National Groups, Locations) from Tweets
- Analyze the Biological Named Entities in Tweet and Research Literature
- Create the Topic clusters over a period of time
- Analyze the sentiments associated with the Topics
- Analyze the Mental Anxiety Pattern
- GeoSpatial Analysis of Outbreaks
- Build and augment prediction model for outbreak I have found a lots of new problems to solve while working on this hackathon and I am very much excited about it.
What I learned
Spark-NLP , Biological NER , Advanced GeoSpatial Visualizations, Deeper understanding of the problem domain and learnt many interesting new requirements as future work
What's next for COVITA
- extend our work further to perform medical document recommendation
- identify sensitive information and cluster in relevant topics over period of time,
- spread positive uplifting messages to users showing signs of chronic mental anxieties
- build a robust outbreak prediction and alerting app
- create a marketplace for medical kit providers by identifying demand and supply.
- build a recommendation model for suggesting and highlighting relevant impactful hashtags and users.
- explore different types of NER models like en_core_web_lg and en_core_web_md
- cluster the different entities based on different metrics like sentiment, followers, frequency with different statistical variations (rate_of_change , moving average, std dev etc.) and create a time-series of the above statistical variations to detect anomaly and patterns.
- create more sophisticated geospatial maps by correlating infection rate with outbreak locations