A rise in immunizations over the last 35 years has directly contributed to the decline in infectious diseases by about 20% (UNICEF). However, the same organization recently released an article stating that the number of pertussis cases in the United States have reached numbers that have not been seen since the 1950's.

There has been a lot of debate surrounding immunization and whether it is necessary to get vaccinated. However, herd immunity is a concept that is often looked over by people who may not believe or want to get vaccinated. The Herd is a web application that is geared towards both Individuals and Organizations and displays Herd Immunization rates for specific regions.

What it does

The Herd uses a database of patient records including information about the vaccination record of each patient in order to calculate the rates of Herd Immunity for a specific region selected by the user. Display of the information differs based on whether the user is an Individual or an Organization.

For Individuals, the app could help them personally be aware of the immunization rates in their neighboourhood.

For Organizations, the app provides them access to more information that allows for optimization of both time and money with respect to advocacy efforts.

How we built it

Building this project consists of three distinct components.

  1. Data Munging
  2. Frontend (Interactive Web Application)
  3. Backend (Database that holds the dataset)

To obtain patient records, Synthea (an open-source patient population simulator) was used. Synthea generates realistic data for fictional patients in Massachusetts. The Dataset mirrors the real population while being free from Protected Health Information and Information that can be used to identify individuals.

With a significant focus on Data Analysis for this project, R Shiny was identified to be the ideal tool to develop an interactive web application. R Shiny allows for the development of an elegant and intuitive interface while being able to handle advanced analytics.

MongoDB was identified to be the best platform to host our dataset because of its support of dynamic queries, not needing to convert or map application objects to data objects and for its simplicity in scalability.

Challenges we ran into

  1. The Wireless Network at Johns Hopkins was very restrictive in terms of both speed and security restrictions. The network made it almost impossible to maintain a steady connection to the MongoDB cluster. This meant that it took several hours just connecting and terminating connections repeatedly just to create a collection and to upload data.

  2. Background of the Team - Out of 5 team members, 2 have a Pure Mathematics and Quantitative Economics background, 1 has an Applied Math background, 1 has a background in Bioinformatics and 1 in IT Support and Data Administration. Collectively, the Programming knowledge of the team was very limited and we often had to rely on each other and our skills to search for information to create the web application

  3. Cleaning up the Synthea Dataset took almost a third of the duration of the hackathon. The dataset contained rows where the data was present in incorrect columns. For example, Birthplace was present in the PatientID variable, FirstName was present in the Address variable. Since these fields were all classified as Strings, it took an enormous effort to clean the dataset and verify that it was in fact usable.

  4. We used our expertise in Mathematical Modelling to analyse research and build a Mathematical Model to increase the accuracy of calculating Herd Immunity. However, we were not able to identify public datasets with a majority of the variables required to implement the model due to privacy regulations like the GDPR and HIPAA.

Accomplishments that we're proud of

In the short span of time, the team was able to create a web application that serves as a strong proof-of-concept of our idea. We are proud to be able to create an interface that is user friendly and generates a robust data visualization.

Given our limitations in the programming aspect of the process, we heavily relied upon our background in Math and Statistics and our strengths in understanding complex functions and breaking them down into simple terms to learn to program on the fly and use the process of trial and error to build an application.

Each member of the team was active and involved in the process checking on each other regularly to offer support with regards to technical issues or act as a sounding board and discuss the algorithms used to clean and process the data.

If able to obtain access to the data, the members of the team are confident of their abilities to implement the Mathematical Model and increase the accuracy of the predictions given for the rates of Herd Immunity.

What we learned

With four out of five members not having a strong experience in programming and three out of five members not taking any classes related to healthcare, we were able to create a strong proof of concept. We learnt from each other about specific concepts relating to developing web applications using R Shiny, Creating databases and research relating to immunizations and herd immunity.

What's next for The Herd

  1. Implementing the Mathematical Model - Provided that access to the variables required in the model is obtained, the model to calculate the rate of herd immunity can be calculated with greater accuracy compared to the current calculations of the application.

  2. Including environmental variables - A Machine Learning model could be trained to look at a 2-dimensional map of a specific location and to identify whether the area is urban, rural, wooded, etc. The model could also be used to identify areas of high traffic like Airports, Malls and Office Buildings based on analyzing the text marking the locations. These variables could then factor into generating a range within which the rate of Herd Immunity for that region lies as calculating a single value will not be very reliable for areas of high traffic as it is impossible to keep track of all individuals who visit the area and to obtain their records of immunizations without violating Privacy laws and considered as being "Orwellian".

  3. Helping Advocacy efforts through identification of marketing avenues - Using Machine Learning Models to generate automated advertisement copies that identify the location, rate of Herd Immunity and the diseases that have a high probability of being spread in the region. Advertisements that are personalized have been shown to have higher rates of engagement. This could lead to an increase in the number of people who obtain immmunizations and thus increase the rate of Herd Immunity for that region.


Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238,

Anderson, Roy M., and Robert M. May. "Vaccination and herd immunity to infectious diseases." Nature 318, no. 6044 (1985): 323.

Fusco, Taryn. "If Vaccine Rates Keep Falling, These Diseases Could Make A Comeback". UNICEF USA, 2019,

Share this project: