The COVID-19 outbreak has struck a lot of fear in people. The speed and ease with which it spreads, as well as the mortality of this disease, requires a lot of attention and care from people. Thus, we are seeing a lot of changes in society, from a lot of people being restricted from going to work, a lot of companies and schools switching to an all-online teaching style, and many cities mandating quarantine protocols to limit the spread of this virus and decrease burden on hospitals. However, there is a silver lining…
As a result of the extreme lack of control and a growth in technology, scientists are rushing to understand as much as they can about the structure and functions of SARS-COVID-19 in an attempt to find a weak spot in the virus where a drug can hinder the spread of the virus or fight the viruses of those infected. This race has expanded the available resources for people researching the virus, which includes sequencing data.
We have sequenced a lot of organisms and pathogens, but never before have we repeatedly sequenced pathogen genomes in such a short period of time. To diagnose people with SARS-COVID-2, PCR-RT sequencing is being used, meaning almost every new case of COVID-19 has a genome attached to it. And that data is being released to the public.
Genetic analysis tool creators Nextstrain have been collecting this data, compiling it on their website, and creating visuals of the spread and mutation patterns of the virus. With the sequence data from a lot of COVID-19 cases and their amazing genomic analysis tools, they can produce phylogeny trees, detect single nucleotide and amino acid mutations, as well as overlay it on a map to shaw a change-over-time animation of how the virus has spread. Amazing as it is, it only shows past data. What if we can use this data to predict what would happen in the future?
That is what our team created. Mutare is a prediction tool that uses a binomial-negative-binomial (BNB) Stochastic model to predict levels of mutations within a virus. With the data and analysis tools given by Nextstrain, we can organize and map the mutations within COVID-19. From that data, we can extract pseudo-birth, death, and mutation rates that we can input into our BNB stochastic model to give an estimate as to how many mutations we can expect over time given the behavior or a virus. These estimations can be used to predict the severity of a virus and provide insight in changes in the protein structure of a virus. Mutations in important regions of the virus, such as changes in the spike protein for SARS-COVID-2, are what leads to epidemic stages for viruses and predicting these allows scientists to monitor any worrisome changes in viruses.
This project is submitted under the Computational Biology track for the COVIDHacks Hackathon.
What it does
The tool inputs data organized on the Nextstrain Github repository, sifts and organizes the data into nucleotide and amino acid mutations, date, location, and virus. With that data, it analyzes the mutation rates, infection rates, and curative/death rates to create the pseudo- birth, death, and mutation rates for the BNB model. With the model now complete, it runs multiple times to create a distribution of probabilities that a certain number of mutations will be present at a certain time after the start of the virus spreading (patient zero). That data can then be used to make predictions on the potential severity of a virus.
The Github repository for this project is our main deliverable. There, we have the process we took as well as the scripts and data we used to make the final data format, which is there as well. Other people will be able to download the data and continue where we left off. Also there are the incomplete Python files for the BNB Stochastic Model, where people can download and edit the foundation of the model software.
How we built it
All the data and analysis software was provided by Nextstrain with the Github repository of data and the Augur bioinformatics analysis tool. To produce the readable data files from the raw data, we produced a series of Python files to maybe alter the sequence files or edit the metadata files from the Nextstrain Github repository. The model was also produced using Python, and the algorithm was inspired by Mather, Hasty, and Tsimring and their paper Fast stochastic algorithm for simulating evolutionary population dynamics.
Challenges we ran into
There were many challenges we ran into; however, the largest was the learning curve. Our teammates knew enough about biology to understand the final results of the analysis and enough about programming to get the data into a condensed and readable format, but we lacked the necessary knowledge to fully implement a modeling technique and subsequent analysis of the mutation data. The research and gaining of necessary background information took up most of our time, and when it came time to debugging our program and running the necessary analysis to get the information needed for the model, we were out of time to complete it. The model right now is the main bottleneck for us, as that connects the data from the first part and the necessary presentation of the data that we want to do, so that everyone has access to our findings. Yet we believe that once the model produces legitimate and understandable results, integrating it into a website and making it interesting will be easy.
Accomplishments that we're proud of
Over the course of the hacking period, there were several milestones that really pushed the project forward
- A script that parses through the data downloaded from the Nextstrain Github repository and creates the necessary files for the Augur library to perform analysis.
- A script that runs all of the necessary command line commands to perform the analysis. For each disease, there are 6 commands that must be run from the terminal, each having about 5 flags to input and taking anywhere from 5 second to 30+ minutes to complete. Having a script do that for us made the process super easy to do.
- Finding the stochastic model was amazing. None of us on the team are statisticians or have any modeling experience, but the model required very few parameters and we were able to understand and implement the strategy due to the author’s easy-to-understand steps.
What we learned
Data is only as good as the person reading it. If the data is inaccessible and confusing, then it means nothing and is just numbers and letters. However, once it is put into a format that people can understand, then it has immense value. That is what our product really focuses on. The large increase in viral genome data from scientists sequencing it is great, but when looking at it, it just looks like a case name followed by a long (and I mean LONG) sequence of A, T, G and C’s. Once analyzed, and those sequences get aligned and they are compared to a base sequence, and those changes are translated into proteins and amino acid mutations are identified, and we can see what those mutations look like in the actual protein, then we can have a lot of use for the data obtained by these scientists.
Using our program, we are adding value to what the scientists are giving us. Genomics can have a large impact, but it is only as good as our ability to analyze it and translate it into useful information. That is what we have strived to do.
What's next for Mutation Predict
What we are presenting at the time of hack submission is far from a finished product; however, we have a pretty good idea of what needs to be done to make it so.
- We need to make the analysis protocols for the final output data. We have the data for 15 viruses ready to be analyzed, but we have yet to make scripts to parse the data and analyze it to extract the pseudo-birth, death and mutation rates for the virus.
- Edit the BNB Stochastic model to give reliable data using the pseudo-birth, death, and mutation rate. The BNB model was made to model evolutionary patterns in animal species, not for viruses. I say pseudo- birth, death, and mutation rates because these numbers are not birth, death, and mutation rates in the conventional sense. Therefore, the model must be edited to account for these different types of numbers so that it outputs something understandable and accurate.
- Make general edits to our code for the BNB model.
- Analyze the output of the model to produce usable information. Since the model is stochastic, it will need to be run multiple times in order for any usable information to be produced. The number of iterations will need to be a balance between the amount of data and time it takes to do the simulation. With the BNB model, it is a very fast simulation, therefore larger amounts of simulations can be completed rather than with other styles of stochastic modeling.
- Make the data we get from the model and output it to a website that will display all the valid points in an easy-to-read format. This is an easy fix, but we cannot do this until we know the format of the output data from the model, and we have yet to figure that out.
- Make the pseudo-birth, death, and mutation rates equations instead of constants. We know that these rates can change with time and with the number of people infected, so a future update can be with the analysis of the data, we can see how these factors change with time and population of infected people, allowing for a more accurate model. This update may also fix the BNB model, since at the moment, it only accepts constants and not functions of time and population. This step is very important in getting reliable data.
- Connect the web app to the simulation in order to allow for easy access to the public, as well as easier data visualization.