Inspiration
Multiple Sclerosis (MS) is an autoimmune disease that affects nearly 1 million people in the US alone each year. In MS, autoreactive t-cells trigger an immune response against myelin, the protein that protects the nerves. As a result, MS causes progressive disability in the central nervous system and brain, and no cure currently exists.
Diagnosis for MS is long, difficult, and expensive. It requires a combination of symptoms, as well as the identification of active lesions or scar tissue in the brain through MRI. This means that by the time MS patients are diagnosed and placed on disease modifying therapies, there is permanent damage already done.
Our vision is of a world where a simple blood test can accurately diagnose patients with MS, allowing for earlier, cheaper, and more convenient detection. We believe that the rise of scRNA-seq data can make this idea into a reality.
What it does
The algorithm takes scRNA-seq data from a patient blood sample, and predicts whether the patient has multiple sclerosis (MS).
How we built it
The first step of the pipeline we built is preprocessing. This takes the raw fastq files (generated by sequencing of the Peripheral Blood Mononuclear Cells (PBMC) and aligns them to a reference genome. Then, the reads are analyzed with TRUST4, a TCR assembly algorithm designed for speed.
This algorithm reconstructs the t-cell receptor (TCR) region for each t cell. The TCR region is responsible for the recognition of foreign epitopes by the adaptive immune system. In autoimmune diseases or during infection, the relative abundance of t-cells with a given TCR region shifts in a process called clonal expansion.
After the TCR data is generated for each sample, we normalize to generate relative abundance. This is the data that is sent to the ML algorithm.
This processing was done using a Runpod Serverless GPU endpoint. Without the high compute and easy compatibility, the preprocessing would be exponentially more difficult.
First, we tried to use a transformer architecture based on BERT-TCR (https://github.com/zhangbeibei-min/BertTCR/), which uses TCR regions to diagnose disease (cancer). The code required a significant amount of modifications and some architecture changes to run. BERT-TCR is built upon a pretrained protein model, which is useful due to the limited nature of data and high computational power needed for processing. This model encodes the cdr3 region (which includes the most variable, and thus important parts of the TCR region) to give it vectorized meaning. Ensemble learning is used on MS prediction scores gathered from each cdr3 sequence, resulting in one final score for a given patient. Then, training is conducted with the relative abundance of MS vs non-MS TCR regions.
However, this model was unable to converge due to our limited data, and we pivoted to a Random Forest Classifier, a much simpler model that required less data. This new algorithm was able to achieve an AUC of 0.960+-.053 and recall of 0.9+-.17 after a short training period.
Challenges we ran into
The first challenge was data availability. In total, we were only able to identify about 150 samples of scRNA-seq data for the blood of MS patients. This is not a very large amount of data to train an algorithm on.
The second, and largest challenge, was that the preprocessing of the data was extensively difficult, time consuming, and required large amounts of computation. In the process of finding TCR sequences, there has to be alignment of the raw reads to a reference genome. It turns out that using cellranger, it takes 2-4 hours for this pipeline to run per sample, making the analysis of 100+ samples infeasible in the time frame.
We pivoted to using a much faster algorithm at the cost of TCR sensitivity. Our new analysis with TRUST4 yielded workable data, but still in small quantities (about 25 samples positive, 50s of control).
The final challenge was the implementation with the transformer model. It seemed like in the end, the lack of data severely bottlenecked the model. Even with multiple hyperparameter sweeps, the model would not converge given the data and simply overfit to the training data.
Thus, we also pivoted to using a much faster algorithm–although this time, a much simpler one. Finally, we used flask to generate a simple GUI for demos.
Accomplishments that we're proud of
We are proud of being able to process terabytes of data in a short period of time, thus validating the methodology. Our current results suggest that this method has potential to break a barrier in MS diagnosis and treatment. We think that this project can go even further in a research setting, given more time to process data and experiment with training more architectures.
What we learned
We learned that some architectures need more time to accomplish than 36 hours. We learned how to go about processing complex datasets, and the value of efficiency in data processing. Further, sometimes simple models work better than complex ones, especially with limited time and resources.
What's next for Multiple Sclerosis Blood Test using TCR ML Algorithm
Given more time to process the datasets we gathered, we expect to see an improved model. We can also perform testing on a larger dataset to validate our results.
We want to try out different pretrained model architectures. We had plans to use the ESM-2 protein model by Meta, but did not have the time to fully implement this. We also want to use models pretrained on specifically TCR data rather than general proteins (or, given our extensive dataset sweep, pretrain our own) to further account for our lack of data. Further, we can expand on the Random Forest architecture and combine it with our deep learning approach.
Log in or sign up for Devpost to join the conversation.