Inspiration
As medical technology has progressed over the past several decades, genetic diseases have become increasingly prevalent, with many earlier treatable ailments presenting less of a concern. As such, the ability to gain early insights into genetic disorders at an early age, especially fetal, would allow medical practitioners to take preventative action and gain a better understanding of how to care for their patients.
What it does
Current fetal genomic examinations involve dangerously invasive processes for extracting amniotic fluid, risking infection and physical hazards. Our approach facilitates minimally invasive cDNA extraction from the mother's blood, which is then put through our extensive model to reconstruct the baby's genome and reveal any concerning anomalies.
How we built it
Datasets depicting disease-labelled cDNA fragments and genomes were datamined from FinaleDB and PGP Havard respectively. cDNA fragments and locales were compiled into a frequency distribution, normalized, and then used as a probability distribution function to sample indices and associated DNA fragments from the full human genomes. The new DNA fragments and associated disease labels were then tokenized and then put into our LLM as training data.
Challenges we ran into
The availability of high-quality genomic data was the biggest hindrance in this project. Datasets were either protected for patient privacy and of those public, most were either very small or messy, requiring extensive data cleaning. The majority of the time hacking was spent searching for datasets, and then web scraping their sites due to poorly or nonexistent APIs.
Accomplishments that we're proud of
We're definitely proud of powering through the web scraping process, which ended up taking around 8 hours in total, considering the massive size of the sequences, alongside difficulties with data formatting.

Log in or sign up for Devpost to join the conversation.