Ebola Prevalence Prediction in Africa

In this report, a predictive model was proposed to predict the prevalence of Ebola in Africa. There are 159478 locations information provided in dataset1. One observation is corresponding to one location. For these locations, longitude, latitude, population, (region) name and so on are provided. In dataset 2, total Ebola cases are reported in 61 regions. Dataset1 and2 were merged by (region)name. Then we can extract 61 location information(longitude and latitude) belonging to 61 regions in dataset2. The prediction is designed as a classification problem. First, I labeled the prevalence in 61 regions with total cases. Total cases numbers ranking top 25 percent in 61 regions were labeled as high prevalence. Total cases numbers ranking bottom 25 percent in 61 regions are labeled as low prevalence regions. The regions in the medium were labeled as medium prevalence regions. After labeling, we have the prevalence level for each region. For 61 extracted locations in merged data, we assumed the locations kept the same prevalence level with the region they belong to. In this way, 61 locations prevalence level in dataset1 were labeled. This 61 locations longitude, latitude and prevalence label consist the training dataset. As only a few available features are shared by all locations in dataset1, longitude and latitude were used as features to train. The classification method I used was random forest. The R package ’randomForest’ was applied to train the 61 location features. Then all the other locations prevalence level in dataset1 were predicted based on the training results. So we have the prevalence level for all the locations in dataset1. Currently the average classification error in training dataset is 49.18 percent which is very high. If we can add one more feature(e.g. population), the classification error decreased to 39.34 percent even though there were lots of missing data for population. So I consider if we can have more features shared by all the locations, the classification accuracy will be increased further. From the prediction results, there are totally 8559 high prevalence locations. All the locations are in country GN(1639), LR(2731) and SL(4189). Low prevalence locations distribute in 6 countries. Most of the locations are in country NG,ML,CD and SN. Also all the locations in country SN are low prevalence locations. In future, the work may include adding more features(e.g. health facilities, red cross, wealth, education and so on) to train the data for improving the accuracy of classification. Also I consider to measure the classification performance by plotting ROC curve. Besides that, the work may also include SVM classification and prediction. The results from SVM and random forest could be compared. Finally, we can map the prediction results for visualization.

Share this project:

Updates