PRISM: Predicting Responses in Single-cell oMics
Inspiration
Drug discovery remains one of the most costly and time-consuming challenges in biotechnology. Understanding a compound’s mechanism of action (MoA) at the cellular level is critical for developing new therapeutics, yet existing approaches often require labor-intensive experiments and are difficult to scale. With the rapid growth of large-scale single-cell transcriptomic datasets such as Tahoe-100M, there is an opportunity to leverage machine learning to predict cellular responses and accelerate MoA identification.
What it does
PRISM is a machine learning framework designed to predict mechanisms of action directly from gene expression profiles. By combining multimodal representation learning with supervised classification, PRISM can:
- Encode high-dimensional transcriptomic data into latent embeddings.
- Learn contrastive representations between small molecules (via ChemBERTa-derived SMILES embeddings) and cellular responses.
- Perform supervised classification to predict MoA labels from single-cell omics data.
The platform enables researchers to generate actionable hypotheses on how compounds perturb cellular states, supporting both basic science and early-stage drug discovery pipelines.
How we built it
- Data preprocessing: Gene expression data from Tahoe-100M was curated, normalized, and formatted into AnnData objects for downstream modeling.
- Representation learning: We implemented a contrastive learning pipeline inspired by CLIP/InfoNCE to align small-molecule and transcriptomic modalities in a shared embedding space.
- Supervised head: A multi-layer perceptron (MLP) was trained on top of embeddings to predict compound MoA.
- Tools & frameworks: PyTorch, scVI-tools, Hugging Face (ChemBERTa), Polars for data handling, and Optuna for hyperparameter optimization.
- Deployment: Training was orchestrated on an HPC cluster with SLURM scheduling, and the pipeline is structured for scalability to millions of datapoints.
Challenges we ran into
- Variable-length expression vectors: Raw transcriptomic data from Tahoe-100M required careful preprocessing to standardize input dimensions.
- Balancing contrastive vs. supervised learning: Designing a framework that leverages both modalities without overfitting was a key challenge.
- Data scale: Handling tens of millions of single-cell profiles demanded efficient preprocessing and distributed training strategies.
Accomplishments that we’re proud of
- Successfully built a working end-to-end pipeline from raw single-cell data to MoA classification.
- Demonstrated the feasibility of contrastive multimodal learning for transcriptomics + chemistry.
- Designed a modular architecture that can be extended to other omics modalities (proteomics, cell-painting, etc.).
What we learned
- Contrastive learning can be adapted effectively for biological modalities, provided the embeddings are carefully aligned.
- HPC workflows require thoughtful engineering for I/O and preprocessing bottlenecks.
- Building scalable, reproducible bioinformatics pipelines depends as much on data engineering as on model architecture.
What’s next for PRISM
- Scaling up: Extend training to the full Tahoe-100M dataset and other multimodal single-cell resources.
- Generalization: Evaluate the framework across unseen compounds, dosages, and cell types to test robustness.
- Integration: Explore integration into drug discovery loops, including docking proxies, generative chemistry, and active learning pipelines.
- Open science: Package PRISM as a reproducible toolkit for the community, enabling researchers to benchmark models on public single-cell datasets.
Log in or sign up for Devpost to join the conversation.