Predicting APA Site Choice from mRNA Sequences

Abstract

Our project aims to train the DNABERT-2-117M machine learning model to predict where and which polyadenylation site a gene will use, based on features derived from the RNA sequence. By identifying sequence motifs (such as AAUAAA variants), nucleotide composition, and position within the transcript, the model will highlight key factors influencing site choice.

We will train and test the model using publicly available datasets from PolyASite 2.0 and PolyA_DB, which provide experimentally validated catalogs of polyadenylation sites, and GENCODE, which supplies comprehensive gene annotations. Together, these resources give us a high-confidence, genome-wide reference of polyadenylation sites and their transcript contexts. The model’s output will include predictions and further visualizing aids are included.

This tool could help researchers understand APA regulation and potentially detect disease-associated changes in RNA processing.

Workflow

Phase 1: Planning and Research

Tasks were delegated to members as follows:

Raw data processing
Neural network research
Presentation and visuals

For the ml model, we decided to use the transformer-based genome foundation model DNABERT-2-117M. Why a transformer: Self-attention captures long-range motif interactions (AAUAAA, UGUA, U/G-rich) across tens–hundreds of bases; pretrained genomic representations and parallelism make training fast and interpretable for k-mer attributions. Why DNABERT-2-117M: Genome-native k-mer tokenizer, ~117M params (hackathon-friendly), easy loading (AutoTokenizer/AutoModelForSequenceClassification), strong prior for motif-based classification, and token-level attributions map cleanly to biological k-mers.

We chose PolyASite 2.0 and PolyA_DB because they provide experimentally validated catalogs of human and mouse polyadenylation sites across multiple tissues and experimental conditions. These are the gold-standard references for APA studies, ensuring our model learns from real, biologically relevant examples rather than predictions.

GENCODE was used for high-quality gene annotations (including transcript boundaries, strand orientation, and exon/intron structures). This ensured that each polyadenylation site could be contextualized relative to its host gene and transcript.

From these datasets, we kept the following columns because they are most relevant for predicting APA site choice:

chromosome, strand, start/end position → defines site location
gene_id / transcript_id → links the site to its gene context
PAS motif (sequence window) → captures canonical and non-canonical motifs
distance to stop codon / transcript end → reflects positional bias in APA usage

We compiled these sources into one standardized .csv using genome_kit, which allowed us to lift coordinates to hg38, extract surrounding RNA sequence windows, and harmonize the annotation fields. The result is a unified dataset ready for sequence-based model input.

Phase 2: Preparation for Model Training

A data pre-processing pipeline was built to format raw RNA sequences for the chosen ml model, DNABERT-2-117M. Pipeline was scaled up to work on batches of data.

Phase 3: Model Training

Trained the model on our dataset using Google Colab servers. Set up and ran the end-to-end transformer training and evaluation pipeline for DNABERT-2-117M in PyTorch and Hugging Face on A100 GPUs. Implemented batching, mixed-precision training, checkpointing, Git LFS for large artifacts, and run management. Trained on ~800,000 labeled sequence windows from PolyASite 2.0, PolyA_DB, and GENCODE.

Phase 4: Evaluation and Interpreting

Used built-in functions from transformers and scikit-learn to evaluate model accuracy. Compiled raw BED/GTF files into one CSV (including data points like ±50-nt windows, AAUAAA/variants, GC%) for model checks. Curated a smaller set of data for demoing the tuned model and generating visualizations using seaborn.

Results and Metrics

86% polyA site prediction accuracy. AUROC 0.93, AUPRC 0.94, F1 0.86.

Challenges we faced; Things we learned

Coming into this hackathon, our main objective was to learn. We signed up as undergrads from a variety of backgrounds—astrophysics, CS/math, mechanical engineering, medical sciences, and one in bioinformatics. Only two of us took biology in high school, and none of us knew machine learning going in, so nearly everything was new. We spent a lot of time on the basics (genomics terms, datasets, preprocessing, tokenizers, evaluation) and on tooling. We learned fast, asked good questions, split tasks, and kept iterating. We shared what we knew—whether that was how to use Git or what a protein is. By the end of the weekend, every member played an integral part, turning what we learned into something we can all be proud of (and finally pronounce the title of)!

Built With

genome-kit
numpy
pandas
python
pytorch
scikit-learn
seaborn
transformers
zhihan1996/dnabert-2-117m

Submitted to

Toronto Bioinformatics Hackathon 2025
- Winner Second Place

Created by

Trained the model, set up cloud compute, and helped the team with solving git problems

Boris Kafidov
Built the data pre-preprocessing pipeline; bookkeeper/discord mod…

Nicole Jiang
Akshita Sharma
Angelina Fernandes
Anirisaihan Anirisaihan