Predicting if a SNV is pathogenic or benign

Inspiration

The inspiration for this project comes from the growing role of SNPs in both diagnosing diseases and guiding personalized treatments. While certain SNPs have long been used to identify disease risk, advances in genomics now allow us to use SNP profiles to tailor therapies to individual patients. By developing a low-cost, machine learning tool that interprets Sanger sequencing data from key genes, we aim to make this precision approach fast, affordable, and accessible in clinical settings, empowering doctors to provide truly personalized care.

What it does

This machine learning code uses neural networks to identify SNPs that are causal mutations in various human diseases. It takes as input Sanger sequencing results of specific genes of interest and analyzes each variant to determine its association with particular diseases. The output provides actionable information for medical staff, allowing them to identify disease-linked alleles and tailor treatment strategies to the patient’s unique genetic profile.

How we built it

We built the tool using Python for data processing and modelling, combined with Linux commands (via WSL) to handle large datasets. Public databases, including ClinVar, were integrated to provide annotations for known SNPs. For modelling, we used a pre-trained LM nucleotide transformer to generate sequence embeddings and scores. Features such as chromosome number, gene name, and mutation type were used to train a random forest regressor. This framework, alongside neural networks and logistic regression models, enables the tool to classify SNPs as pathogenic or benign from Sanger sequencing data, providing actionable insights for personalized treatment decisions.

Challenges we ran into

One of the key challenges we faced was integrating the LM nucleotide transformer into our workflow. Although the model was pre-trained, obtaining accurate scores and embeddings for the sequences required careful handling, and ensuring these outputs were compatible as features for downstream modelling was complex. We combined these embeddings with additional features such as chromosome number, gene name, and mutation type to train a random forest regressor, and managing the interactions between these diverse data types presented both technical and computational hurdles throughout the process. Another challenge we encountered was incorporating different types of mutations, as the training dataset is currently limited. We hope that future, more comprehensive datasets will help overcome this constraint.

Accomplishments that we're proud of

We are proud of the progress we made despite significant challenges. Most of our time was spent navigating complex data mining tasks to collect the necessary sequences, and running the LM nucleotide transformer presented a few technical hurdles. While we were not able to generate the full dataset required to complete the project as originally envisioned, the experience allowed us to develop a strong understanding of the workflow, tools, and potential obstacles in building a predictive pipeline for SNP pathogenicity.

What we learned

Through this project, we gained valuable experience in setting up a complete pipeline for analyzing SNPs. We learned how to integrate different tools, models, and data sources in a structured workflow, which provides a flexible framework that can be expanded to analyze other mutations in the future with only minor adjustments. This experience also gave us insight into potential improvements and optimizations for similar projects down the line.

What's next for Predicting if a SNV is pathogenic or benign

Next steps include expanding the dataset to cover more SNPs associated with a wider range of human diseases. We also aim to predict how identified SNPs impact splicing patterns, which could reveal novel therapeutic targets and disease biomarkers, further enhancing precision medicine strategies.

Built With

databases-link-the-clinvar-database-react-reply-11:31-framework-neural-networks
linux-commands/wsl
neuralnetworks
python
regression

Submitted to

Toronto Bioinformatics Hackathon 2025

Created by

I worked on data mining, created a class to help the development of the ML model, and contributed to creating the presentation.

augustina2023
I proposed the project and was Team Lead. I also worked on coding the ML model, and on parsing the data to set up for inputs and outputs.

Emily Escalante
I worked on cleaning and collating the data. I integrated data from the Clinvar database and the human reference genome to produce training and validation datasets for the AI models. I used bash, awk and Python for this task.

Juan Enciso
I worked on the literature review and the elevator pitch

Pallavi Pilaka
Wrote code to generate SNV substitution marginal probabilities and embeddings for nucleotide sequences from nucleotide-transformer.

Katherine Leung
I worked on processing and mining the genomic datasets, as well as annotating each SNV with its respective gene, exon, or regulatory feature to enable downstream analysis. Additionally, I worked on writing the Devpost sections about the project.

tanaya2026 Datar
Developed the GUI for model and assisted with generating the language model embeddings.

Suraj Acharya

Updates

tanaya2026 Datar started this project — Sep 21, 2025 12:57 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.