Inspiration

The inspiration for this project comes from the growing role of SNPs in both diagnosing diseases and guiding personalized treatments. While certain SNPs have long been used to identify disease risk, advances in genomics now allow us to use SNP profiles to tailor therapies to individual patients. By developing a low-cost, machine learning tool that interprets Sanger sequencing data from key genes, we aim to make this precision approach fast, affordable, and accessible in clinical settings, empowering doctors to provide truly personalized care.

What it does

This machine learning code uses neural networks to identify SNPs that are causal mutations in various human diseases. It takes as input Sanger sequencing results of specific genes of interest and analyzes each variant to determine its association with particular diseases. The output provides actionable information for medical staff, allowing them to identify disease-linked alleles and tailor treatment strategies to the patient’s unique genetic profile.

How we built it

We built the tool using Python for data processing and modelling, combined with Linux commands (via WSL) to handle large datasets. Public databases, including ClinVar, were integrated to provide annotations for known SNPs. For modelling, we used a pre-trained LM nucleotide transformer to generate sequence embeddings and scores. Features such as chromosome number, gene name, and mutation type were used to train a random forest regressor. This framework, alongside neural networks and logistic regression models, enables the tool to classify SNPs as pathogenic or benign from Sanger sequencing data, providing actionable insights for personalized treatment decisions.

Challenges we ran into

One of the key challenges we faced was integrating the LM nucleotide transformer into our workflow. Although the model was pre-trained, obtaining accurate scores and embeddings for the sequences required careful handling, and ensuring these outputs were compatible as features for downstream modelling was complex. We combined these embeddings with additional features such as chromosome number, gene name, and mutation type to train a random forest regressor, and managing the interactions between these diverse data types presented both technical and computational hurdles throughout the process. Another challenge we encountered was incorporating different types of mutations, as the training dataset is currently limited. We hope that future, more comprehensive datasets will help overcome this constraint.

Accomplishments that we're proud of

We are proud of the progress we made despite significant challenges. Most of our time was spent navigating complex data mining tasks to collect the necessary sequences, and running the LM nucleotide transformer presented a few technical hurdles. While we were not able to generate the full dataset required to complete the project as originally envisioned, the experience allowed us to develop a strong understanding of the workflow, tools, and potential obstacles in building a predictive pipeline for SNP pathogenicity.

What we learned

Through this project, we gained valuable experience in setting up a complete pipeline for analyzing SNPs. We learned how to integrate different tools, models, and data sources in a structured workflow, which provides a flexible framework that can be expanded to analyze other mutations in the future with only minor adjustments. This experience also gave us insight into potential improvements and optimizations for similar projects down the line.

What's next for Predicting if a SNV is pathogenic or benign

Next steps include expanding the dataset to cover more SNPs associated with a wider range of human diseases. We also aim to predict how identified SNPs impact splicing patterns, which could reveal novel therapeutic targets and disease biomarkers, further enhancing precision medicine strategies.

Built With

  • databases-link-the-clinvar-database-react-reply-11:31-framework-neural-networks
  • linux-commands/wsl
  • neuralnetworks
  • python
  • regression
Share this project:

Updates