Twitter may be a data resource to support healthcare research. Literature is still limited related to the potential of Twitter data as it relates to healthcare. The purpose of this study was to contrast the processes by which a large collection of unstructured disease-related tweets could be converted into structured data to be further analyzed. This was done with the objective of gaining insights into the content and behavioral patterns associated with disease-specific communications on Twitter.

This is the overview of my dissertation, published in August 2016. My focus at that time was on Cancer, Diabetes and Asthma ('Data Mining Twitter for Cancer, Diabetes and Asthma Insights', Purdue University 2016). During the course of this research I found evidence to support the hypothesis that a large and generally untapped source of healthcare data is generated on Twitter. The study puts forth data cleansing methods to handle the various data quality and cleansing issues necessary to perform on Twitter data prior to any analysis, and then applies segmentation on the resulting files to examine contents and pattern by diseases.

I am collecting a similar dataset now (and searching for additional existing ones, please reach out if you have data to share, I'll set up a site to share mine as well) on the keywords 'COVID', 'virus', etc. that will serve as a catch-all on related tweets and then will perform data cleansing and analysis related to geolocation, date, symptoms, and more to track patterns over time related to region, individual-level experience, symptoms, demographics that are absent from publically available datasets related to COVID-19 today.

Research has proven that twitter data can be used to accurately detect new outbreaks and patterns related to influenza and Ebola, providing data that is available 2 weeks prior to CDC data on related events. The data will hopefully provide a new dataset that will support epidemiological studies that are stalled today due to the lack of epidemioloical data, absence of a global data governance strategy to manage global public health data.

This tool will provide this Twitter data on a daily basis to researchers, both in it's raw and 'cleansed' format for analysis. It will provide a daily report on insights generated from the file, and related forecasts and findings related to country, state and county-level pattern differences in both publicly-reported COVID-19 morbidity and mortality data, and related keywords, concepts and structured variables derived from the unstructured Twitter data. Once the batch process is refined, this can be automated with streaming Twitter feeds through the API for continual updates to support the generation of real-time trends, patterns and insights. Essentially, given the number of patients, this is not a 'big data' or IOT problem. Other data sources (such as Twitter, search data on Google and other global search engines, mobile tracking data, etc) do however represent supplemental data that falls into the category of 'big data' and IOT and may serve as a proxy for the currently unavailable global patient-level data needed to develop key insights.

Share this project: