One of our group members has a solid background in biochemistry and he demonstrated how the Simple Sequence Repeat (SSR), a type of repeated DNA sequence whose basic units range from one to six base pair long and are repeated five to fifty times, has been exploited in forensic genetics, cancer biology, and plant genetics. SSR are usually located in the non-coding region in the DNA genome and they are not function-related. Consequently, SSR exhibit a very high mutation rate, leading to a high genetic diversity similar to fingerprints. This feature gives SSR lots of application in forensic identification, cancer diagnoses, and plant breeding. The current problem is how to find out all the SSR in a given DNA sequence.
What it does
Our project has three parts:
- SSR database generator that creates a database which includes all the possible SSR.
- DNA sequence simulator that generates a random DNA sequence with designated length and numbers of SSR inserted as the user wants. This produces testing DNA sequence for the final SSR characterizer.
- SSR characterizer that finds and counts different SSR types and how many times a certain type of SSR appears with individual indexes of where they start and end in the DNA sequence.
How we built it
We divided the workload based on the subprojects' difficulties. Junqi was responsible for the SSR database generator and the DNA sequence simulator, while Aaron, Anthony, and Jason were responsible for SSR characterizer. We all participated in the final review of the code.
Challenges we ran into
- How to eliminate duplicate SSR in the database.
- How to speed up the linear search.
Accomplishments that we're proud of
We made a comprehensive SSR database that includes all the possible SSR and a fully functional SSR characterizer that can characterize all the SSR in a given DNA sequence in the terms of types, location indexes, and frequencies.
What we learned
We learned how to apply computer sciences and biostatistics in solving genetic problems and the interdisciplinary trends in computer science and biology nowadays.
What's next for HackDavis2019
We will add some data visualization module and probably set up a website for the script so other genetic researchers can also use this program.