The RNA modification N6-methyladenosine (m6A) is implicated in a variety of cellular and disease processes. The location of m6A sites on a given RNA transcript may determine its degradation, stability, signaling function, or dysregulation, and can be a marker of disease states. Thus, an important area of work has been the identification of m6A sites. Molecular methods to do this are slow and expensive, requiring specific sequencing data in combination with precipitation of antibodies or binding proteins. In light of these limitations, it is pragmatic to predict m6A sites de novo from patterns learned by analyzing existing sequencing data.
What it does
We sought to predict from the sequence of an RNA the location of its m6A sites within that sequence. Specifically, given a sequence of ACGT letters, we want to predict the index of the letter in the sequence which contains the m6A modification.
How we built it
We implemented the Gene2vec architecture described previously by Zou et al., 2019. This architecture processes RNA sequences into "words" which are encoded using Word2Vec and passed thorugh a convolutional neural network, which predicts m6A sites. All RNA sequences are associated with any existing m6A sites, allowing for model training.
Challenges we ran into
The most difficult challenge facing our project was the enormous size of processed data. Final processing created hundreds of thousands of windows in RNA transcripts, and from these windows, millions of RNA sentences. In fact, iteration over all the windows was computationally intractable in Python: running skipgram_generation_dl_final.py was incomplete after occupying the full resources of an exploratory account on the Brown CCV Oscar system (32 CPUs, 48-hour runtime). Another difficult step was preprocessing of sequence data: the training data provided in the supplement to the paper only gave Ensembl transcript IDs, we had to write our own R script to annotate these IDs with their actual RNA sequence, and then parse the sequence into windows that could possibly have m6A sites.
You can find our one-page summary here.
You can find our in-depth plan (check-in 2) here.
You can find our final write-up here.
Log in or sign up for Devpost to join the conversation.