PRISM: Predicting Responses in Single-cell oMics

Inspiration

Drug discovery remains one of the most costly and time-consuming challenges in biotechnology. Understanding a compound’s mechanism of action (MoA) at the cellular level is critical for developing new therapeutics, yet existing approaches often require labor-intensive experiments and are difficult to scale. With the rapid growth of large-scale single-cell transcriptomic datasets such as Tahoe-100M, there is an opportunity to leverage machine learning to predict cellular responses and accelerate MoA identification.

What it does

PRISM is a machine learning framework designed to predict mechanisms of action directly from gene expression profiles. By combining multimodal representation learning with supervised classification, PRISM can:

  • Encode high-dimensional transcriptomic data into latent embeddings.
  • Learn contrastive representations between small molecules (via ChemBERTa-derived SMILES embeddings) and cellular responses.
  • Perform supervised classification to predict MoA labels from single-cell omics data.

The platform enables researchers to generate actionable hypotheses on how compounds perturb cellular states, supporting both basic science and early-stage drug discovery pipelines.

How we built it

  • Data preprocessing: Gene expression data from Tahoe-100M was curated, normalized, and formatted into AnnData objects for downstream modeling.
  • Representation learning: We implemented a contrastive learning pipeline inspired by CLIP/InfoNCE to align small-molecule and transcriptomic modalities in a shared embedding space.
  • Supervised head: A multi-layer perceptron (MLP) was trained on top of embeddings to predict compound MoA.
  • Tools & frameworks: PyTorch, scVI-tools, Hugging Face (ChemBERTa), Polars for data handling, and Optuna for hyperparameter optimization.
  • Deployment: Training was orchestrated on an HPC cluster with SLURM scheduling, and the pipeline is structured for scalability to millions of datapoints.

Challenges we ran into

  • Variable-length expression vectors: Raw transcriptomic data from Tahoe-100M required careful preprocessing to standardize input dimensions.
  • Balancing contrastive vs. supervised learning: Designing a framework that leverages both modalities without overfitting was a key challenge.
  • Data scale: Handling tens of millions of single-cell profiles demanded efficient preprocessing and distributed training strategies.

Accomplishments that we’re proud of

  • Successfully built a working end-to-end pipeline from raw single-cell data to MoA classification.
  • Demonstrated the feasibility of contrastive multimodal learning for transcriptomics + chemistry.
  • Designed a modular architecture that can be extended to other omics modalities (proteomics, cell-painting, etc.).

What we learned

  • Contrastive learning can be adapted effectively for biological modalities, provided the embeddings are carefully aligned.
  • HPC workflows require thoughtful engineering for I/O and preprocessing bottlenecks.
  • Building scalable, reproducible bioinformatics pipelines depends as much on data engineering as on model architecture.

What’s next for PRISM

  • Scaling up: Extend training to the full Tahoe-100M dataset and other multimodal single-cell resources.
  • Generalization: Evaluate the framework across unseen compounds, dosages, and cell types to test robustness.
  • Integration: Explore integration into drug discovery loops, including docking proxies, generative chemistry, and active learning pipelines.
  • Open science: Package PRISM as a reproducible toolkit for the community, enabling researchers to benchmark models on public single-cell datasets.
Share this project:

Updates