The majority of our work comprised a cleanup and initial exploratory plotting of the data sets provided. While we were able to fit a few statistical methods, we concluded that a detailed restructuring best be performed on the data before much time is spent applying and tuning models. We have also brainstormed several ideas for modeling that we think could work well, and noted several probable interfaces between the other teams’ work. Here is our progress thus far.
During our initial brainstorming session, we decided to focus on the relationships between the count data and the placement of treatment centers. We wondered if we could predict the way their placement was selected (an easily verifiable value, which could perhaps be used to train a model), and also perform a causal inference of their effects on cases and deaths. While we did not have time to create such a model in the time frame of the hackathon, this objective served to guide our exploration of the data, and precipitated the formulation of numerous questions that could be used in future analyses.
Munging and visualization of the Sub National Time Series data set:
As a step towards “Tidy Data,” time series count set was transposed from a long to wide format, with the data for each type of count now stored in their own column of the data frame.
We noticed that the count data are provided by five different organizations, which may well have used different methods to collect and record them. There also appeared to be numerous clusters by source in the time series. The vast majority of the data are government, but determining what use there might be for the other sources will require additional analysis.
Next, we extracted the total number of cases (maximum of the count) for each region. We used these pooled counts for several calculations detailed later, but with the understanding that they were problematic. For example, there were regions where the number of deaths exceeded confirmed cases.
To look for further anomalies in the data individual scatter plots of case and death counts for every region with the date on the x axis and the count at each time point on the y. The scatter points were colored by the data source. A handful of these plots were inspected manually, and a number of interesting features were discovered over the course of the hackathon:
1) Vast amounts of missing data. Many of the missing points are closely bounded by nonmissing values in the time series and so can be easily interpolated, but there are also large gaps and regions that are barely represented. There is a column of New Cases, which could potentially be used to verify and/or repair the count. 2) Clumps in time series by source. 3) Start time is varied, though there seems to be a higher density of observations in later months. Counts that start low and work their way up could indicate data collection beginning before an outbreak in that region, while counts that begin at a higher level were likely recorded after the region became infected. 4) It is not explicitly detailed whether a count was entered at the time the case was discovered, or when the person was known or believed to have become ill. 5) Some regions show discreet updates at a few time points, while others have smoother upward curves. 6) Finally, some regions’ case counts are clearly recorded as a cumulative count of cases confirmed thus far at each time point (the way the death counts are), yet others show decreases in cases at later time points. We wonder if some counts are being updated as people recover or die while others are not, or whether this observation is due to errors. It is possible that this question can be answered by other clues in the data or with a machine learning algorithm
We read in the covariate data and calculated the correlation between each variable and the pooled case and death counts. Of the 22 covariates, five had correlations with the counts stronger than an absolute value of .4 and a sign that seemed correct. They were Years of Education, Years of Ed. for men and women, age, and whether or not the location is urban. However, we found numerous reasons to believe that all of the above correlations are inaccurate due to collection bias and/or errors.
A series of linear regressions were then performed between the cumulative count data and each of the covariates, with similar results.
Finally, we merged in and began to process the time series data on the opening and closure of treatment centers. We found issues with the data similar to those of the count data. For example, according to the data set, there is a treatment center that opened on August 20th, 2014 and closed on July 28th, 2014. Unfortunately, we did not have time to further implement these data.
A feasible next step towards our objective will be to go back to the count plots and add a vertical line to each plot on the dates where treatment centers were opened in that region, and visually check for any apparent patterns. This could potentially guide the next step in this process, which will be to fit machine learning algorithms to the relationship between treatment center establishment and case/death counts.
We have included R code with hopefully helpful commentary, and five example plots:
The first plot is the summed counts of all of West Africa. Each line is a country. Clearly, there are some issues...
The next plot is an example of a region where the data were from a few sources.
The next two are counts that started later in the epidemic.
The next two are the case and death counts from the same country: The first has missing values, the second is discreet.
The last plot seems to be the most useful.