Inspiration

To accelerate scientific discovery we need to identify a large number of compounds with high precision. MS/MS could do that. The catch: We don't know 90% of the molecule spectra.

What it does

Given a molecule, predicts the molecules MS/MS spectrum.

How we built it

We combine two state of the art models.

  1. FraGNNet predicts the spectrum with high resolution and interpretable annotations of peaks and their fragments
  2. DreaMS can go beyond the usual training data, by large scale unsupervised learning of unlabelled spectra

Challenges we ran into

Dataset size and quality. We filtered the Enveda dataset for missing entries and low quality spectra. This ended up being smaller than the NIST and Mass Spec Gym datasets, and more heterogenous due to instrument types/adducts. There were also some imbalance w.r.t. instrument type (more Orbitrap than QTOF) and precursor adduct (more [M+H]+ than [M-H]-) which might also hurt performance. More work to be done to find the right data mixture.

Accomplishments that we're proud of

Our model beats all baselines on the NIST and Envedas dataset!

What's next for MSEffect

How do you best incorporate DreaMS into FraGNNet to improve accuracy?

  • Regularize the FraGNNet latent to be similar to DreaMS
  • Train a molecule-to-embedding encoder to retrieve similar spectra from DreaMS Atlas
Share this project:

Updates