PRISM

PRISM: Predicting Responses in Single-cell oMics

Inspiration

Drug discovery remains one of the most costly and time-consuming challenges in biotechnology. Understanding a compound’s mechanism of action (MoA) at the cellular level is critical for developing new therapeutics, yet existing approaches often require labor-intensive experiments and are difficult to scale. With the rapid growth of large-scale single-cell transcriptomic datasets such as Tahoe-100M, there is an opportunity to leverage machine learning to predict cellular responses and accelerate MoA identification.

What it does

PRISM is a machine learning framework designed to predict mechanisms of action directly from gene expression profiles. By combining multimodal representation learning with supervised classification, PRISM can:

Encode high-dimensional transcriptomic data into latent embeddings.
Learn contrastive representations between small molecules (via ChemBERTa-derived SMILES embeddings) and cellular responses.
Perform supervised classification to predict MoA labels from single-cell omics data.

The platform enables researchers to generate actionable hypotheses on how compounds perturb cellular states, supporting both basic science and early-stage drug discovery pipelines.

How we built it

Data preprocessing: Gene expression data from Tahoe-100M was curated, normalized, and formatted into AnnData objects for downstream modeling.
Representation learning: We implemented a contrastive learning pipeline inspired by CLIP/InfoNCE to align small-molecule and transcriptomic modalities in a shared embedding space.
Supervised head: A multi-layer perceptron (MLP) was trained on top of embeddings to predict compound MoA.
Tools & frameworks: PyTorch, scVI-tools, Hugging Face (ChemBERTa), Polars for data handling, and Optuna for hyperparameter optimization.
Deployment: Training was orchestrated on an HPC cluster with SLURM scheduling, and the pipeline is structured for scalability to millions of datapoints.

Challenges we ran into

Variable-length expression vectors: Raw transcriptomic data from Tahoe-100M required careful preprocessing to standardize input dimensions.
Balancing contrastive vs. supervised learning: Designing a framework that leverages both modalities without overfitting was a key challenge.
Data scale: Handling tens of millions of single-cell profiles demanded efficient preprocessing and distributed training strategies.

Accomplishments that we’re proud of

Successfully built a working end-to-end pipeline from raw single-cell data to MoA classification.
Demonstrated the feasibility of contrastive multimodal learning for transcriptomics + chemistry.
Designed a modular architecture that can be extended to other omics modalities (proteomics, cell-painting, etc.).

What we learned

Contrastive learning can be adapted effectively for biological modalities, provided the embeddings are carefully aligned.
HPC workflows require thoughtful engineering for I/O and preprocessing bottlenecks.
Building scalable, reproducible bioinformatics pipelines depends as much on data engineering as on model architecture.

What’s next for PRISM

Scaling up: Extend training to the full Tahoe-100M dataset and other multimodal single-cell resources.
Generalization: Evaluate the framework across unseen compounds, dosages, and cell types to test robustness.
Integration: Explore integration into drug discovery loops, including docking proxies, generative chemistry, and active learning pipelines.
Open science: Package PRISM as a reproducible toolkit for the community, enabling researchers to benchmark models on public single-cell datasets.