Inspiration
To accelerate scientific discovery we need to identify a large number of compounds with high precision. MS/MS could do that. The catch: We don't know 90% of the molecule spectra.
What it does
Given a molecule, predicts the molecules MS/MS spectrum.
How we built it
We combine two state of the art models.
- FraGNNet predicts the spectrum with high resolution and interpretable annotations of peaks and their fragments
- DreaMS can go beyond the usual training data, by large scale unsupervised learning of unlabelled spectra
Challenges we ran into
Dataset size and quality. We filtered the Enveda dataset for missing entries and low quality spectra. This ended up being smaller than the NIST and Mass Spec Gym datasets, and more heterogenous due to instrument types/adducts. There were also some imbalance w.r.t. instrument type (more Orbitrap than QTOF) and precursor adduct (more [M+H]+ than [M-H]-) which might also hurt performance. More work to be done to find the right data mixture.
Accomplishments that we're proud of
Our model beats all baselines on the NIST and Envedas dataset!
What's next for MSEffect
How do you best incorporate DreaMS into FraGNNet to improve accuracy?
- Regularize the FraGNNet latent to be similar to DreaMS
- Train a molecule-to-embedding encoder to retrieve similar spectra from DreaMS Atlas
Log in or sign up for Devpost to join the conversation.