MeTrO - Analysis Pipeline

CCLE Dataset

Inspiration

The hope with multi-omics data integration is to learn more than we would have using only one data modality. We wanted to see how gene expression and metabolite profiles vary across conditions. VAEs are perfect for this because they learn a shared latent space where hidden biological patterns emerge and capture more non-linear relationships than traditional methods such as MOFA+. By aligning the latent spaces, the hope is to build a generative model that can capture perturbations in one modality.

What it does

Our pipeline ingests transcriptomic and metabolomic data, trains a joint VAE to learn a low-dimensional latent space, and then uses that space to identify hidden gene–metabolite relationships and predict phenotypic outcomes.

How we built it

We combined Python-based VAE architectures (PyTorch + scVI-style layers) with pre-processing modules for transcriptomics and metabolomics. Each omics layer has its own encoder, but they share a latent space learned jointly. The decoder reconstructs both data types, and we overlay pathway and network annotations on the latent factors. For validation, we performed GSEA and MOFA.

Challenges we ran into

Getting two very different data distributions (counts vs. continuous metabolite abundances) to train stably in a single VAE. Choosing the right likelihood functions (NB for counts, Gaussian for metabolite data) without destabilizing the KL term. Scaling to thousands of features without overfitting in a hackathon timeframe. We had issues with the AWS instances, where we lost most of our results and data.

Accomplishments that we're proud of

Built a joint-omics VAE from scratch in under a weekend. Stable training with different likelihoods for each modality. Latent factors aligned with known pathways — and revealed new ones.